The Goal for This Lecture on Data Preparation Is to Summarize the Steps That Researchers

Data Entry

Slide 1

The goal for this lecture on data preparation is to summarize the steps that researchers perform in transforming their data from its collected form into its computerized form.

Slide 2

As is often the case, I need to begin with some basic terminology. A file is a complete data set. A record is a summary of the responses by one individual or data concerning one object. A field is the individual variables within a record, such as questionnaire number, street, city, or response to question one.

Slide 3

The end product of the data preparation process is called the data matrix. If you’re familiar with spreadsheets such as Quattro or Excel, then what you have created on many occasions is a data matrix. Notice for data matrices in marketing that each record is a row and the different fields, which correspond to the different variables in a data set, are represented by the columns.

Slide 4

Simply defined, data entry is the process of transferring data from sources like questionnaires into a computer database. Often that data is numeric, in which case the subsequent analysis is statistical in nature. Other times that data is alpha or word form, such as in content analyses and subsequent analysis is done with specialized programs that do word counts and the like.

Slide 5

I’ll now discuss the following five steps in data preparation: validation, editing, coding, data entry, and machine cleaning of data.

Slide 6

In the data collection stage, researchers check that interviews were conducted as specified. They ensure all respondents were qualified to participate in the study. If personal interviews were conducted, interviewers must be monitored for professional behavior and appearance. Researchers should check that interviews were conducted in the proper environment; for in-office or in-home interviews, the researcher must ensure that they were not conducted at a local tavern! Finally, the validation process includes checking that all appropriate questions were asked and that interviewers didn’t skip certain questions for inappropriate reasons.

Slide 7

The next stage in the data preparation process is the editing stage. This differs by type of interview. For personal interviews, the researcher should check for (1) omissions, which are improperly unanswered questions, (2) ambiguities, which are vague responses to open-ended questions, and (3) inconsistencies, which are incompatible answers to different questions. (For example, the answer to one question indicates childlessness but the answer to another question indicates a parent to several children.) The researcher also should check for proper skip patterns: answers to questions that should have been skipped should be deleted, and efforts should be made to acquire answers to questions that should have been answered. Finally, the researcher should check that answers were recorded properly, especially to open-ended questions. As it’s difficult for personal interviewers to record extensive answers, some completed interview forms may contain only one-or two-word answers, when it’s obvious respondents provided much more detail.

Slide 8

In editing self-administered questionnaires, researchers should check that all key questions were answered. For example, if an important aspect of the data analysis phase is to compare the responses of males versus females, but a respondent failed to indicate his/her sex, then that questionnaire is useless. Researchers also should check that respondents understood the instructions and took the questionnaire seriously, which isn’t always an easy call. I’ve conducted surveys in which respondents have circled the identical answer on a 7-point scale all the way down a page. Were those answers truly reflective of that respondent’s opinions, or was that respondent merely providing a meaningless response? I assumed the answers reflected the respondent’s true opinions. Finally, researchers should check for no missing pages, especially for lengthy questionnaires, and that questionnaire was returned before the cutoff date. Things that happen in the world can influence people’s responses, so responses to questionnaires returned after a cutoff date for submission may differ because of environmental changes.

Slide 9

The editing process is subjective. How does a researcher know that a respondent took the task seriously? How does a researcher know what was meant by an ambiguous response? There’s no right or wrong way to cope with such editing problems, but there are some general solutions. If the respondent is not anonymous, then he or she should be re-contacted in an effort to resolve response ambiguities or inconsistencies. Another non-preferred alternative is to discard the questionnaire. Given the costs per completed questionnaire, discarding a questionnaire is expensive, so researchers tend to avoid it. If the questionnaire seemingly was taken seriously, but some responses are problematic and the respondent is anonymous, then a third approach is to use only the good items. Although this practice has data implications, the implications are beyond the scope of this course.

Slide 10

The first stage in the preparation process for data is coding, which is the process of grouping and assigning numeric codes to different question responses. This process is easier when dealing with close-ended questions because they’re pre-coded.

Slide 11

Here’s a reminder of what I mean by a pre-coded questionnaire. There are numbers adjacent to each of the answers that respondents might provide, and those numbers are entered into the computer database.

Slide 12

Exploratory research may require one or more open-ended questions. The way researchers deal with answers to those questions is as follows:

(1) Generate a complete list of all reasonable responses to each question.

(2) Have multiple judges develop response categories for each question that are mutually exclusive and exhaustive. Developing such categories is a subjective process, so each judge is like to develop a unique set. Differences between those sets must be resolved by the judges and researcher(s) because only one set can be used for coding.

(3) After the judges, coders, and researcher(s) agree on the consolidated response list, a numeric code is assigned to each response.

(4) Have the coders return to the completed questionnaires, read each response, and assign one or more numeric codes that correspond to the consolidated response list.

This process is involved and complex, which is why I recommend that for non-exploratory survey research you take the time to develop close-ended questionnaire items, because coding open-ended questions is far more difficult than creating good close-ended questions.

Slide 13

Here’s an example of a portion of a code book for a travel study. The questions are relatively straightforward, so the codes are relatively simple.

Slide 14

Stage four of the data preparation process is data entry, in which dated, validated, and coded questionnaires are given to the person who will enter the data into the computer database. It’s far more accurate and efficient to go directly from questionnaires to data entry and storage, as opposed to transcribing the data onto coding sheets and then entering the content of those sheets into the database. That intermediate step only can introduce additional error.

Slide 15

If specialized forms are used, such as mark sense forms, then it’s possible to avoid the data entry operator and to scan responses directly into a computer database. Alternatively, optical scanning of questionnaire responses—similar to the way the post office scans the address you’ve written on an envelope and routes your letter to the appropriate post office—is possible. Regardless, questionnaire data ultimately must be transferred to a computer database.

Slide 16

The data entry process can be either intelligent or dumb. In intelligent data entry, there’s checking of entered data for internal logic by the data entry device or other connected device. Excel, Quattro, and SPSS rely on dumb data entry. If intelligent data entry relies on software, then the software can be prepared to recognize inappropriate responses. For example, if the response to a question can be only ‘1’, ‘2’, or ‘3’, and the data entry person types a ‘4’, then the computer will not accept that keystroke as a valid response. Dumb data entry, because it lacks error-checking protocols, may necessitate subsequent data cleaning.

Slide 17

With machine cleaning of data, software such as SPSS can be programmed to identify and suggest fixes for logical errors. For example, it would be illogical for a respondent to indicate in an answer to one question that he’s single and never been married, but in an answer to another question indicate a spouse’s preferences. SPSS can be programmed either to delete or merely flag inconsistent answers (allowing researchers to decide the appropriate fix on a case-by-case basis). In addition, it’s possible for software such as SPSS to generate marginal reports, which is a table of response frequencies for each question. Using SPSS in this way makes it easy to identify invalid keystrokes and improper skip patterns. For example, marginal reports indicate the number of responses to each question; if there are seemingly too many valid responses to a legitimately skipable question, then the researcher would be alerted to a skip pattern problem.

Slide 18

This slide shows the logic that would be programmed in to a set of machine-cleaning instructions for SPSS. For example, there are only eight valid city codes, the respondent’s ID can’t exceed 9999, and neither negative numbers nor alpha characters are valid.

Slide 19

Those are the five steps in the data preparation process. Now, in finishing this lecture, I’ll briefly discuss recoding data and ways to cope with missing data.

Slide 20

Recoding data means using computers to convert original codes, used for the raw data, into codes that are more suitable for subsequent analysis. Perhaps reverse coding is needed. In that case, the researchers needs to identify each question—probably an attitude question—that are worded ‘in reverse’; for a 7-point scale, the response is subtracted it from 8. There are other ways to recode data. Although many marketing relationships are non-linear, many frequently used statistical analyses assume linear relationships; hence, it may be necessary to transform the data into a linear form. It may also be necessary to recode data for complex socioeconomic variables, such as social class, which are comprised of several components (level of education, income, occupational status, and other indicators).

Slide 21

Another type of data recoding entails reducing or collapsing the number of response categories. Here’s an example from a Likert Scale. The original scale has five points: strongly agree, agree, neither agree nor disagree, disagree, and strongly disagree. The bottom, left-hand side of the table in the slide shows the percent of respondent who gave each answer. It’s possible to collapse this scale from five points to three points by combining the ‘strongly agree’ and ‘agree’ answers and combining the ‘disagree’ and ‘strongly disagree’ answers. Such collapsing of variables may be necessary for cross-tabulation analysis.

Slide 22 (No Audio)

Slide 23

Although the complexities of missing data are beyond the scope of this course, let me at least alert you to a couple of issues and suggest ways to handle missing data. Missing data is often not randomly distributed. For example, refusal to answer an income question is related to both gender and education level.

Slide 24

Missing data is certainly related to the type of question; the more personal and potentially threatening the question, the more likely a non-response. Here’s an example of the relative frequencies of non-response to different items. Questions about sex and income are the most likely to be unanswered. Questions about age, occupation, and marital status are less likely to be unanswered.

Slide 25

How should missing responses be handled? The possibilities include:

· Leave the response blank and allow the computer to record it as a missing response.

· If data for any question is missing, then all the data for that person (or record) could be deleted.

· Pairwise deletion also is possible. If the correlation between two variables is of interest, and either of those variables has missing data, then the data for that person (or record) is excluded.

· The mean response—either for the entire sample or the subgroup from which the respondent was drawn—could be substituted for the missing value. This method assumes that the overall or group mean is the best guess for that person’s response had he or she opted to answer the question.

· An imputed response—in which the respondent’s answers to questions that are most highly related, in a statistical fashion, to the question without data—could be calculated and substituted for the missing response.

Each of these approaches has pluses and minuses; many have statistical implications that are beyond the scope of this course. If you are required to conduct a marketing research study, I recommend that you leave missing data blank in your database.

Page | 3