EUGene: A Conceptual Manual
D. Scott BennettAllan C. Stam
Department of Political ScienceDepartment of Political Science
The Pennsylvania State UniversityYale University
107 Burrowes Building124 Prospect St.
University Park, PA 16802-6200New Haven, CT 06520-8301
Tel: 814-865-6566Tel: 203-432-6220
Fax: 814-863-8979Fax: 203-432-6196
Email: mail:
International Interactions 26:179-204. (forthcoming)
Abstract: The study of international relations using quantitative analysis relies, in part, on the availability of comprehensive and easily manipulable data sets. To execute large-n statistical tests of hypotheses, data must be available on the variables of interest, and those data must be manipulated into a suitable format to allow the inclusion of appropriate control variables as well as variables of central theoretical interest. This paper introduces software designed to eliminate many of the difficulties commonly involved in constructing large international relations data sets, and with the unavailability of data on expected utility theories of war. In order to solve these two problems, we developed EUGene (the Expected Utility Generation and Data Management Program). EUGene is a stand-alone Microsoft Windows based program for the construction of annual data sets for use in quantitative studies of international relations. It generates data for variables necessary to incorporate key variables from implementations of the so-called “expected utility theory of war” into broader analyses of international conflict. EUGene is also designed to make building international relations data sets simple. It accomplished this by automating a variety of tasks necessary to integrate several data building blocks commonly used in tests of international relations theories.
Key words: Expected Utility; Software; International Relations; Statistical Analysis
Note: The order of the authors’ names is alphabetical and is not intended to connote principal authorship. This material is based upon work supported by the National Science Foundation under grants SBR-9601151, SES-9975115, and SBR-9975291. We would like to thank Bruce Bueno de Mesquita for his invaluable assistance in generating expected utility data.
INTRODUCTION
The study of international relations using quantitative analysis relies on the availability of comprehensive and easily manipulable data sets. While seemingly trivial, this statement highlights one of the many technical hurdles IR scholars face. To execute large-n statistical tests of hypotheses, data must be available on the variables of interest, and those data must be manipulated into a suitable format to allow the inclusion of appropriate control variables as well as variables of central theoretical interest. Frequently, however, the process of preparing data sets for analysis is cumbersome, particularly data sets with many cases and with variables that come from a variety of sources. Frequently control variables are excluded from analysis, not for theoretical or statistical reasons, but simply because cumbersome data manipulation tasks preclude optimal test design. The somewhat daunting task of preparing large data sets can have the effect of turning scholars into technicians for substantial periods of time rather than remaining focused on theory development and research design improvements. Alternately, for many scholars the barrier to entry into the realm of quantitative research is sufficiently high to preclude any sophisticated analysis at all.
One of the most important theories for which data have not been available for testing, either as a theory of interest or as an essential control variable, is the so-called expected utility theory of war (Bueno de Mesquita, 1981, 1985; Bueno de Mesquita and Lalman, 1992). While being one of the most important current theories of international conflict, Bueno de Mesquita’s variant of expected utility theory has until recently only been tested on a small set of cases.[1] This is because expected utility data have not been available for all dyads and over the full time span of most other international relations data sets (1816 through the 1980s or 1990s, depending on the data set). Replication of Bueno de Mesquita’s data has been slow, in part because of substantial barriers inherent in replicating data construction algorithms and finding adequate computational capacity for creating these data. Perhaps just as importantly, none of the tests of other theories (democratic peace, power transition, arms races, etc.) has been able to include controls for the measures that Bueno de Mesquita claim to be powerful predictors of interstate conflict. Several of these measures may be correlated with the other variables of interest, potentially introducing substantial omitted variable bias in much of the recent quantitative work on war and dispute initiation.
This paper introduces software designed to eliminate many of the difficulties commonly involved in constructing large international relations data sets, and to make available complete data on Bueno de Mesquita and Lalman’s most recent version of the so-called expected utility theory of war. We have developed EUGene (the Expected Utility Generation and Data Management Program) in order to solve these two problems. EUGene is a stand-alone Microsoft Windows based program for the construction of annual data sets for use in quantitative studies of international relations. It generates the data needed to incorporate key variables from Bueno de Mesquita and Lalman’s work (1992) into broader analyses of international conflict. EUGene’s purpose is to make the construction of international relations data sets simple. It accomplishes this by automating a variety of tasks necessary to integrate several data building blocks commonly used in tests of international relations theories.
Users of the program simply specify the type of data set they would like to create by selecting from a series of drop-down menus. Choices include the unit of analysis, population of cases, variables to include, and output format. The program assembles a data set according to these user specifications and outputs it for analysis in other statistical software packages. The program also creates command files that make reading the data into other statistical programs automatic. Users of EUGene do not need to be able to write a single line of computer code in order to merge data, read data from input files of varying formats, or convert data into common units of analysis. By reducing the time necessary to carry out routine data set construction tasks, EUGene allows users to proceed more rapidly to the analysis stage, and allows scholars to spend more time on theory development and on asking new research questions than on data management. Below, we describe EUGene, its functions and operation, and key details of its data generation algorithms. EUGene is available as freeware from the American Political Science Association Conflict Processes Section’s home page at
SIMPLIFYING DATA MANAGEMENT
The first of EUGene’s two major goals is to facilitate analysis in international relations by minimizing the relatively unproductive time scholars must spend on routine tasks of data set construction and manipulation. An obvious but critically important first step in any quantitative data analysis is to create an accurate and appropriate data set. Such data set construction is frequently an onerous and time-consuming task. EUGene serves as a tool to facilitate and simplify the process of merging and creating data sets in international relations, especially data sets created with the directed dyad-year as the unit of analysis. Scholars have increasingly come to use data sets based on the dyad-year to conduct quantitative analyses. This is because dyadic interaction lies at the heart of strategic international behavior, and because it is possible to combine explanations from multiple levels of analysis in one quantitative study. In particular, directed dyad-year studies can include variables from the individual level (e.g., polity type), dyadic level (e.g., balance of forces, distance), and system level (e.g., polarity, system uncertainty) in studies of conflict (see, e.g., Bremer, 1992; Huth, Bennett, and Gelpi, 1992; Huth, Gelpi, and Bennett, 1993; Maoz and Russett, 1993; Oneal and Russett, 1997; Beck, Katz and Tucker, 1997; Ray and Wang, 1998).[2] Most scholars rely on annual data both because data are widely available at this level of temporal aggregation, and because the year represents a natural political break due to budget cycles, electoral cycles, and the presence of winter that in many areas hampers military action.[3]
Unfortunately, several aspects of creating dyadic data sets make the task difficult for many researchers. On the independent variable side, creating dyadic data sets involves merging data and renaming variables, typically from multiple monadic data sets that require conversion to a dyadic form. Even before this step, appropriate sets of dyad-years must be created, a task requiring users to create dyads from lists of states while verifying that both members of the dyad are indeed system members when they are coded as dyads. Given that there are over 1.2 million dyad-years in the international system between 1816 and 1997, this is a task that must be automated, and must be done accurately. On the dependent variable side, the most common data sets with international conflict events (the Correlates of War Militarized Interstate Dispute data set and Interstate War data set) have not been organized in dyadic form and must be converted into dyadic interactions. The necessary merges and conversions are not always straightforward, requiring users to make important decisions about coding that are sometimes not recognized, and serve as a significant barrier to wider theory development, theory testing, and graduate student training.
Because it has been difficult to easily create quantitative IR data sets that incorporate many variables, scholars have in the past, been forced to spend nearly as much time creating their data sets as they do on theory development and analysis (and sometimes more). This slows the research process, and sometimes tends to turn researchers into technicians rather than scholars. EUGene was created as a menu-driven tool for Windows to make data set creation easier, with several specific goals in mind that we detail below. EUGene allows users to:
- Construct data sets with different units of analysis, and where appropriate convert input data of varying units of analysis to the selected unit, at the click of a button.
- To choose variables for inclusion in final data sets from a variety of input sources without manually writing code to merge various input data.
- To easily select subsets of data based on common criteria such as political relevance, time period, or great power status.
- To make clear the variety of critical but often unstated assumptions about the construction of key dependent variables and the inclusion of problematic cases that go into the construction of international relations data sets, and force users to make informed decisions about these items.
- To facilitate replication by providing a single program for data set creation that will produce the same results for all users, eliminating the problem of hidden or forgotten steps typically encountered when attempting replication.
Unit of Analysis
The first choice made by users when creating a new data set in EUGene is the most fundamental of all research design decisions, namely the unit of analysis. EUGene allows users to choose to create data sets with the country-year, directed dyad-year, directed dispute-dyad, and directed dispute dyad-year as the unit of analysis.[4] By selecting these units of analysis, users can examine monadic time-series (by creating a country-year data set), examine dispute initiation from a condition of peace or examine the duration of peace (by creating a directed dyad-year data set), examine the escalation of disputes (by creating a directed dispute-dyad data set), or examine the evolution and duration of disputes over time (by creating a directed dispute-dyad-year data set).
Variables
EUGene allows users to specify the variables that are to be included in the output data set by clicking on a set of check boxes. The program as distributed allows users to choose from a set of over 60 variables from several of the most important international relations data sets. In Table 1 we detail the sources of the data EUGene assembles.
------
Table 1 about here
------
Users can select from over 60 variables when assembling a data set. These include Polity III democracy scores and ancillary components (Jaggers and Gurr, 1995), Correlates of War project capability data (Singer, Bremer and Stuckey, 1972), data on interstate distances, tau-b scores, risk attitude data, contiguity data, region, peace years (Beck, Katz, and Tucker, 1997), expected utility values and international interaction game equilibrium predictions (Bueno de Mesquita and Lalman, 1992), and COW Militarized Interstate Dispute data (Jones et al., 1996). The ability to include many variables in the output data set easily makes EUGene useful for scholars pursuing a variety of research agendas, not just those exploring applications of expected utility theory. For example, variables are available that figure into the democratic peace theory, power transition theory, balance of power theory, and theories of system structure, to name just a few.
Population of Cases
EUGene allows users to specify the scope of the output data set as being either all dyads, or one of a number of subsets of countries and years. Users can specify a particular range of years for output (e.g. just 1945-1992, or 1816-1914), and can select from common subsets of countries (e.g. all dyads, politically relevant dyads, major power dyads, contiguous states, or a user-selected list such as rivals). Alternatively, users may generate all dyad-years and include variables in the output to allow selection at a later time. Being able to select these commonly-used subsets of cases allows users to conduct comparative analyses and explore the sensitivity of their results to factors like era or region.
Case Inclusion Criteria and Assumptions
A variety of other under-discussed issues also arise in the context of creating dyad-year data sets, with most concerning the inclusion or exclusion of cases in a fashion related to the dependent variable and the censoring of cases. The first issue, mentioned above, is how to deal with years with ongoing militarized disputes. EUGene allows users to code ongoing dispute years as either dispute initiations or non-initiations for purposes of the dependent variable, but also allows users to either drop or include all dyad-years where the countries begin the year with an ongoing dispute (users may want to drop such cases if they believe that a new initiation would be censored by the ongoing dispute).
A second issue concerns the treatment of dyads where a state joins into a dispute that is already in progress. Should joiners be included for analysis in the same way as dispute initiators? The information conditions faced by joiners into disputes is fundamentally different than the conditions facing the initial participants, suggesting that perhaps the same model that explains dispute initiation will not explain dispute "joining." So perhaps we should omit joiners from analysis. EUGene also allows users to include or drop such cases by selecting a check box.
A final issue concerns "target vs. initiator" directed dyads. When one state initiates a dispute, it does so against a target state, creating a designated initiator A and target B. But when A initiates vs. B, it is less than clear how to include the directed dyad B vs. A. The problem is that A’s initiation may remove B’s ability to initiate a dispute against A in that same year, and we then do not know how to code the “initiation” variable for the B vs. A dyad because it is censored. EUGene gives users the option to include such target vs. initiator dyads in the data sets it creates, to drop them, or to include them only if there is a subsequent initiation by B vs. A (this would indicate that B in fact did have the opportunity to start a MID against A, and so B’s behavior is not censored).
Merging and Data Conversion
One difficulty with building data sets that combine variables is that input data sets frequently come with different units of analysis and in different formats, requiring conversion at a fundamental level in addition to simply merging. For example, some key IR data sets have the country-year as the unit of analysis (e.g., the Correlates of War national capability data, Gurr Polity data, or data on national risk attitude). Other data sets (or data constructions) have the dyad as the unit of analysis, such as distance data, the Correlates of War contiguity data set, or data on expected utility. Still other data sets are distributed in a hybrid form, such as the Correlates of War Militarized dispute data set, which is dyadic and annual in its underlying form but comes distributed as three separate files that must be merged together. EUGene carries out necessary conversions among the formats, file structures, and differing units of analysis of these data sets as part of the merging process. For users who wish to merge data themselves, EUGene can also be used to generate a simple set of case identification data (country codes and years) for various subsets of data.