Dynamics of Collective Action, 1960-1995
Data Cleaning Procedures
November 9, 2009
Candy Ku, Pat Rafail & Dan Wang
Introduction
This memo details the conventions used for the cleaning and checking of potentially suspicious or out-of-range values on variables of this dataset. The cleaning took place in 2 stages: a preliminary stage when coders turned in a file of coded protest events and then a final cleaning stage. This memo outlines the procedures used at both stages of data cleaning.
Preliminary Cleaning
After research assistants coded events and turned in a file of data entered in Microsoft Access (a file is one-month’s worth of events, as discussed in the Coding Manual), a set of procedures was followed to complete a preliminary cleaning of the file. Each Access file was converted into a SAS data file (using StatTransfer), was uploaded to the Unix Mainframe, and a cleaning program was run on the file. The cleaning program (available upon request) was designed to look for any out of range values on numeric variables and then issue an “error log,” which was printed and given to the original coder of the given file of protest events. This coder would then go back to the original newspaper article and fix any errors found by the program. S/he would then change the Microsoft Access data entry file and return this to the PIs, who would then integrate it into the dataset.
Final Cleaning and Verification in Detail
While the Preliminary Cleaning stage caught many cleaning-related issues, there were still issues that either were not caught by the cleaning programs or were somehow missed when coders went back to the Access file to make required corrections. Thus, once the entire 1960-1995 dataset was complete, two teams of research assistants from Pennsylvania State University and Stanford University examined each variable in the dataset and corrected or verified any miscoded or outlying values. The following changes were given particular attention: 1) Extreme, incorrect or otherwise suspicious values; 2) EVENTID values that were impossible, missing or duplicate; 3) Misspelled city and state names; 4) Typographical errors in the article titles; 5) Unusual or logically inconsistent combinations of initiating group characteristics; 6) Large values of the number of participants reported in events that are boycotts or lawsuits; 7) Suspicious values of the target numbers; 8) Suspicious values of target groups; 9) Events that are missing claims; 10) Events with only one claim of 1300 (“Social”) or 1338 (“Government policy, n.e.c.”); 11) Valences of the first claims of certain events; and 12) Duplicate events. Each of these 12 issues is addressed in more detail below.
1. Verification and Correction of Extreme Values
We began by examining the frequency distribution for each variable in the dataset. Any impossible values (i.e., those due to miscodes during data entry) were examined and corrected using the original source article. For continuous variables, a series of conventions were established to identify values considered to be extreme enough to be suspicious. These conventions included lower boundaries, upper boundaries, and whether to check for missing values. Importantly, we agreed to ignore the continuous measures of injuries and deaths. For categorical variables, any value not present in the codebook was examined and corrected. Each suspicious value was examined by returning to the original New York Times article to verify that the information was entered accurately. These conventions are detailed in Table 1.
Table 1: Cleaning Conventions for Continuous Variables
Variable Name / Lower Limit / UpperLimit / Check Missing
rptyy / 1960 / 1995 / Yes
rptmm / 1 / 12 / Yes
rptdd / 1 / 31 / Yes
evyy / 1960 / 1995 / No
evmm / 1 / 12 / No
evdd / 1 / 31 / No
stories / 0 / 1000 / No
page / 0 / Over 100 / Yes, for 1991-1995
section / None / None / Yes, and impossible values
paragrph / Less than 1 / Over 200 / Yes
days / None / Over 1000 / No
staten / None / Over 50 / No
cityn / None / Over 50 / No
particex / None / Over 1,000,000 / No
smonum / None / Over 20 / No
targnum / None / Over 4 / No
arrestex / None / Over 1000 / No
dollars / None / Over $1,000,000 / No
2. EVENTID Values
We examined a total of 29 cases that had erroneous or duplicate values for the unique identification variable, EVENTID. We examined each case where the EVENTID value occurred outside of the possible lower range of 6001001 and upper range of 9512999. As well, any unique identification value that appeared more than once was identified and examined individually. If needed, a new EVENTID code was created. The new EVENTID codes followed the sequence of the valid unique IDs, such as the new codes were based on the last valid code for that year and month + 1. For instance, if the final legitimate EVENTID code for August, 1960 was 6008099, the new code would be 6008100.
3. City and State Names (1991-1995 only)
Several typographical errors for the variables CITY1, CITY2, STATE1, STATE2, STATE3, and STATE4 were corrected. By convention, all of the city and state names were recoded to a consistent spelling with all upper-case letters. Note, however, that we did not verify each city name, and any name that appeared plausible was left as-is. Due to thousands of unique city names, it is possible that a few misspelled cities remain in the data.
4. Article titles
In certain cases, we used the New York Times Historical Archive to examine the original coverage for an event (e.g. while conducting the investigation for duplicate cases). If any typographical errors were found in the article titles, these were corrected. These corrections only apply to a limit subsample of cases that we examined during the cleaning process. As a result, the changes are not systematically done throughout the dataset.
5. Characteristics of initiating groups
For 99 events, the characteristics of the first initiating group listed (IGRP1C1, IGRP1C2) appeared suspicious. For example, an initiating group may be described as both “Black” and “Undifferentiated White.” To verify these values as correct or incorrect, we returned to the original article source for each of these events and changed the values wherever they are incorrect. Some common examples of the characteristics checked include: 1) Any combination of ethnic characteristics; 2) Any combination of the “Occupational Group/Worker Category” with “Youth,” “Students,” “Senior Citizens,” “Disabled or Retarded Persons,” “Residents” or “Veterans;” 3) Any combination of “Youth,” “Students” or “Senior Citizens” with “Homeless Persons,” “Disabled or Retarded Persons,” “Residents” or “Institutionalized Persons;” 4) Any combination of “Institutionalized Persons” with “Occupational Group/Worker Category,” “Religious Groups” or “Disabled or Retarded Persons;” and 5) Any combination of “Veterans” with “Residents” or “Disabled or Retarded Persons.”
6. Large values of the number of participants recorded in boycotts or lawsuits
Before cleaning, there were 963 events that took the form of a boycott, and 3517 that took the form of a lawsuit. We checked events for which the primary form (FORM1) was coded as either a lawsuit or a boycott and in which the number of participants (PARTICEX) was coded as ten-thousand or above. This filter turned up forty-one such events. Of these events, only six were coded incorrectly. In five of the six cases, the event was incorrectly coded as either a lawsuit or boycott, while in only one case was the number of participants incorrect.
7. Suspicious values of target number
All target-related variables were checked for logical inconsistencies. Seven hundred and eighty three events specified that a target was identifiable (TARGD) but had missing values for the number of targets (TARGNUM). Reviewing the articles sources for these events, we were able to ascertain the number of targets for all of them. If we determined that there was no target in the event, both TARGNUM and TARGD were given missing values. Among other inconsistencies, there were six events in which a specific target group was coded (TARG1 or TARG2), but TARGD contained a missing value. In these cases, we replaced TARGD with a value of “1”. After these missing values were replaced, there remained three events in which specific target groups were identified (TARG1 or TARG2) but the number of targets was missing (TARGNUM). In addition, there were forty-nine events in which there was only one target group identified (TARG1) even though the number of targets was coded as two or higher (TARGNUM). Finally, there were fifty-four cases in which the number of targets was coded as one or zero while two target groups (TARG1 and TARG2) were identified. All of these remaining inconsistencies were again hand-checked and remedied.
8. Suspicious values of target groups
There were 103 events in which an ethnic target was identified (ERTARG1 or ERTARG2) but the corresponding target group was given a non-ethnic group (TARG1 or TARG2). Seventy-seven of these events were given a value of “7” (ethnic target) for either TARG1 or TARG2, while the remaining twenty-six had legitimately non-ethnic targets so the incorrectly applied codes for ERTARG1 and ERTARG2 were given missing values. With regard to racial riots, one hundred and seventy events contained non-missing values for ERTARG1 and ERTARG2 and at least one value for the initiating group variables (IGRP1C1, IGRP1C2, IGRP2C1, IGRP2C2) that was an ethnic group code. We applied the following standard—if an initiating group is identifiable for an event that has two different ethnic racial targets, this meant that the article mentioned the group that began the racial conflict/riot event. For most of these one hundred and seventy events, this was true—only fourteen of these events were found not to have an identifiable initiating group.
9. Events that are missing claims
Seven hundred fifty-nine (759) events were missing claims and forms (i.e., no CLAIM1, CLAIM2, CLAIM3, CLAIM4, FORM1, FORM2, FORM3, or FORM4). For each of these events, we returned to the original article source and assigned a value to CLAIM1. Two hundred ninety (290) of these events were racial and ethnic conflicts with no clear initiating or target group. For these events, we created a new claim code, “Ethnic/Racial Conflict, Melees, Riots, Confrontations” (code 2517). This claim code is used to revise claims in subsequent checks of events with first claim codes 1300 (1 event) and 1338 (2 events), and of events that were checked for valences (see part 10 below; 151 events). One hundred and two (102) of these events could not be given a suitable claim code and were thus given the code 1400 (“Other”). Most of these events have more specific Policy Agenda Project (PAP) codes that can be used to identify the nature of the events. All other events were given suitable claim codes.
10. Events with only one claim with code 1300 or 1338
Two hundred sixty-one (261) events had only one claim (CLAIM1) of code 1300 (“Social”), which we determined to be too vague to be useful. To make the claim codes more specific, we returned to the original article source for each of these events and assigned a more specific claim code. Sixteen (16) of these events were determined non-events and therefore dropped from the dataset. Additionally, seven hundred twenty-two (722) events had only one claim with a code 1338 (“Government policy, n.e.c.”), which we similarly decided was too vague to be useful. Again, to make the claim codes more specific, we returned to the original article source for each of these assigned and assigned a more specific claim code when appropriate. Valences of these events were also checked and changed accordingly.
11. Valences of the first claims of certain events
For claim codes 1331, 1342, and 2517, we determined that many of the events with these claims do not have a clear valence (i.e., favoring or opposing side or viewpoint). Thus, for all events with these claims, we assigned valences of 3. For claims where there is an “anti” in the title, we returned to the original article source for each of the events to check that the valences were correctly coded. We assigned new valences where they were incorrect. Note that only the valences for the first claim (VAL1 for CLAIM1) were checked. The claim codes checked were: all in the 700 and 2500 series, 200, 300, 500, 610, 1006, 1109, 1337, 1518, 1520, 1609, 1611, 1612, 1705, 1708, 1709, 1712, 1803, 1805, 1808, 1809, 1903, 1905, 1907, 1912, 2002, 2004, 2006, 2102, 2200, 2600.
12. Checking Potentially Duplicate Events
One issue with this data collection effort was the possibility of duplicate events. That is, events that were already in the dataset but were reported on/referred to in later articles and then coded again. We took this issue very seriously and attempted to remove any duplicate events. The methodology for identifying and flagging duplicate cases was developed by an iterative process where a random sample of events was drawn, and then each member of the cleaning team independently assessed which cases were duplicates. Our central goal was to develop a methodology that was a reasonable trade-off between efficiency considerations and thoroughness. We then compared our approaches, agreed on a preliminary set of decision rules, and drew a new random sample of fresh cases. The team repeated this process until our agreement was at or above 90%. It took an average of approximately 15 minutes to make a reasonably solid judgment on each case from start to finish. The cases varied considerably in terms of the complexity of the task, and some took as few as 2 or 3 minutes; more complicated cases could take up to 30 minutes.
Methodology in Detail
To facilitate comparability across the cleaning team, the process of looking for duplicate events is broken down into two discrete steps: 1) Locating and skimming the original event coverage from the New York Times; 2) Using that information, coupled with the event’s coding, to identify duplicate cases. Each coder was given a merged version of the 1960-1995 data and an Excel file with a set of variables intended to keep a trace of the process. The Excel file included:
- Event ID: The original ID code for each of the events tagged for “follow-up” (FOLLOW_UP). This is a variable that coders used to indicate whether or not the event coded occurred in a previous month. For example, if an article in 1986 reported on a protest event that happened in 1975, we coded it but would indicate that this event could be in the dataset already and that we would need to “follow up” on this to make sure it was not a duplicate event.
- Comments: A mandatory brief comment field that briefly describes our assessment of whether the event is elsewhere in the data and why.
- Flag for Double: A dummy variable used to flag the non-unique cases that are in the data. If the event appears elsewhere in the data, it is coded as 1. If the results from our search suggest the event is unique, this variable is coded as 0. The original entry for a repeated event was given a value of 0.
- Double ID: If the event is flagged for multiple entries, a new unique ID was constructed to cross-reference duplicate cases. If the event appears to be unique, this field is left blank. The duplicate ID values are a text field that contains a ‘D-’ and the event ID value for the original case. For example, if there event ID codes ‘12345’ and ‘6789’, and ‘6789’ is a duplicate of ‘12345’, they would both be given values of ‘D-12345’ in this field.
Fields for ‘Flag for Double’ and ‘Double ID’ were also created in the 1960-1995 data, with respective names of ‘FLAG_FOLLOWUP’ and ‘D_EVENTID’. Our coding conventions were constructed so that all of the duplicate cases can be deleted by eliminating all the cases that have a FLAG_FOLLOWUP value of 1.
A)Strategy for collecting articles
For each case where FOLLOW_UP was equal to 1, we went to the New York Times Historical Archive and searched for the original coverage of the event in question. We used the ‘title’ variable to locate the articles. We downloaded and briefly read each article to familiarize ourselves with the cases.
B)Strategy for tracking down previous entries
Once we were adequately familiar with an event, we noted the event’s city and state, since the geographic information was rarely missing. When appropriate we also examined the year, month, and day that an event took place, though these variables often had appreciable levels of missing data.
We next compared a complete list of cities to the location of the event to see if there were any mentions of that city—including variants arising from typographical errors during data entry—that took place over the entire 1960-1995 period. If there was only 1 entry of an event taking place at that city, we assumed that the event was unique. If city information was not provided, we used the state abbreviations.
If the city was present at least twice in our master list, which was nearly always the case, we constrained the sample to only those events taking place in that city. We also routinely used combinations of state-level searches, claims codes, and other information to further constrain the sample.
We found that focusing on geographic information and claims codes rather than date information allowed us to catch multiple reports regardless of when they occurred. As a robustness check, we also routinely conducted two or more searches for the same event using different logical queries and constraints to ensure thoroughness.