University of North Texas

Geog 5190: Advanced Quantitative Techniques

Important Issues in Data Screening

We deal with this information now as introductory material, but you may want to come back to this material periodically throughout the course (and your data analysis career!) as some of the issues highlighted here come up again.

Accuracy of the Data File

With small data files, proofreading may help in confirming the file’s accuracy. However, with many real-world files, proofreading is simply not an option.

With large files:

  • Apply univariate descriptive statistics to see if the characteristics of the file appear to be in order
  • Are all variables within the range they should be in (e.g. any people ages of 250? people weights of 5000 pounds?)
  • Is the mean and standard deviation of each variable plausible?
  • If you graph the data, does the distribution appear reasonable given what you previously know about the dataset?

Honest Correlations

Inflated Correlations It is possible for high correlations to be meaningless if the variables themselves overlap to some degree (e.g. soil moisture and rainfall).

Deflated Correlations A falsely small correlation between two variables may occur if the range of one of the variables is artificially constrained (e.g. measurement limits that are too restrictive, such as examining the relationship between K-12 education and income).

Missing Data

Missing data may come from a variety of sources:

  • Equipment breakdowns
  • Insufficiently sensitive equipment
  • Incomplete survey responses
  • Abrupt respondent relocation/death on a long-term tracking survey

How can we deal with this kind of problem?

  1. Deleting cases or variables:consider deleting if only a few cases or variables are seriously affected (e.g. 2 respondent deaths out of 1000 respondents in the study)
  1. Estimating missing data: use when missing values are known to fall in a given range. There are three possible methods of estimation here:
  1. Use prior knowledge: requires intimate knowledge of the variable on the part of the researcher.
  2. Use the variable’s overall mean: a conservative estimate, since this does not change the mean for the distribution as a whole once way or another. However, this does reduce the calculated variance, which can be important depending on the total number of cases considered.
  3. Use regression: estimate the missing variable’s value using a regression employing all known values. Cases with complete data generate the regression equation which is used to calculate the estimate.
  1. Complete the analysis with and without the missing data (a “sensitivity analysis”): substitute estimates for the missing data in the first case. If the results for the two analyses are very similar, then the missing data have little impact on your results.

Outliers

Outliers are cases with such extreme values that they distort the overall statistics.

Example: consider a regression calculation where most of the data points fall close to what looks like the regression line, but a small number of points fall well off. The calculated regression line could be skewed by the “far off” points so that the line does not fit well with the majority of points that appear to be accounting for the major trend.

It is important to check for the presence of outliers.

Four reasons for outliers:

  1. Incorrect data entry
  2. Programming error: missing value codes (e.g. 9999) are treated as real data
  3. The outlier is a member of a population other than the one you intended for study
  4. The population that is being studied does not follow a normal distribution, so the outliers are not actually outliers

Detecting Outliers It is important to acknowledge and discuss univariate and multivariate outliers separately.

  • Univariate outliers: extreme values in one variable
  • Multivariate outliers: may be one or more extreme variables values, but may also be an odd combination of variable values that individually fall within the normal ranges for their variables

Example: 16 years old is not an unreasonable age for a human being, and $100,000 is certainly a possible annual income for a person to achieve. However, having a 16 year old making $100,000 would be a very unusual circumstance.

Strategy for Dealing with Outliers Look for and deal with univariate outliers in each variable first, then look for multivariate outliers. One procedure for identifying multivariate outliers is computation of the Mahalanobis distance for each case:

The distance of the case from the centroid of the remaining cases, where the centroid is the point created by the means of all the variables.

Describing Outliers With the multivariate outliers identified, you need to discover what makes them an outlier. If there are only a few multivariate outliers, you might examine them individually to see what makes them unusual. If there are many, you might examine them as a group (maybe the outliers are similar among themselves).

Reducing the Influence of Outliers You can follow a few steps that can be helpful in ensuring that outliers do not distort the results of your analysis.

  1. Check the data for the individual cases to determine if they are accurately entered in the data file.
  1. If the check reveals the data are accurate, check to see if one variable is responsible for most of the outliers. If so, can you eliminate that single variable from the study? This is a good option if the variable corresponds well with one or more of the other variables also included in the study.

If these checks do not eliminate your outlier problem, you must decide whether or not the outliers are properly part of the population you intended to sample. If they are not, then you can delete them from the study with no harm to the generalizability of your results.

If the outlier stays in your study even after all of this, your options include:

  • Transform the variable (such as by taking its logarithm or its inverse)
  • Manually reduce the deviancy of an individual case (better have a good reason, and document what you did)
  • Leave the outlier in the study (but be aware of the impact of the outlier on the study results)

Normality and Linearity

Normality: screening continuous variables for normality is an important early step in almost every multivariate analysis. To do this, check skewness (lop-sidedness) and kurtosis (peakedness).

Linearity: check for linear relationships among study variables. You cannot use Pearson’s r (univariate or multivariate) if linearity is not present. Basic check: look at scatterplots of the relationships between various variable pairs.

Multicollinearity and Singularity

Multicollinearity: variables are highly correlated (e.g. 0.90 and over)

Singularity: one variable is actually made up of a combination of two or more of the other variables (e.g. soil moisture and rainfall).