Panel Data and Methods

Panel data consist of a cross-section of “individuals” for which there are repeated observations over time. Individuals can be any cross-section unit of analysis, such as states, dyads, or survey respondents. Panel data is sets are typically dichotomized between long panels, which have many measurement occasions relative to the size of the cross-section, and short panels, which have many individuals in the cross-section relative to the number of repeated measurement occasions, or “waves.” In general, the methods associated with the term “panel data analysis” or “longitudinal analysis” focus on short panels, while methods under the “time-series cross-section” umbrella focus more on analyzing long panels.

The key advantage of panel data is that such data offer the opportunity to better evaluate causal propositions than strictly cross-sectional data. Whereas cross-sectional data only allow the researcher to observe covariances, panel data further allow the researcher to observe whether a change in an input precedes a change in the outcome. In other words, since panel data consist of the same individuals over time, the analyst can observe a shift in responses as a reaction to an input. In a clinical trial, this may come in the form of diagnosing whether a symptom changes more substantially over time in a treatment group compared with a placebo. In political science, this mayOne example would be an evaluation of whether a state’s present behavior responds to the prior behavior of its neighbors. Another example might be using a survey panel, such as those often incorporated into the American National Election Studies, to assess how partisan strength influences campaign interest over time.

Drop medical example,, add example of political behavior (NES or some other panel)

Issues of Panel Data and Common Remedies

Panel data have a number of features that can pose challenges to analysts. These issues include: unit effects, serial correlation, heteroscedasticity, and contemporaneous correlation. Panels also have special problems of missingness. The remainder of this entry focuses on these issues and some remedies for each.

Unit Effects

Whenever individuals’ mean responses differ, unit effects are present in the data. Unit effects can pose serious problems for inference as failure to account for them in some way can produce bias in estimates akin to omitted variable bias. If the mean response varies cross-sectionally via unobserved unique means, but this difference is not modeled (and thereby left in the error term), then any cross-sectionally varying covariate will correlate with the error term. Such a situation produces endogeneity bias in the model’s coefficients.

In the econometric tradition, two approaches are widely used to handle unit effects. One is the fixed effects model, typically estimated with least squares dummy variables (LSDV), also called a fixed effects model. This approach estimates the desired model using OLS, including dummy variables for each individual, save a reference individual. This approach has the advantage of being computationally simple and accounting for a known source of variance in the model specification. However, individual dummies are perfectly collinear with any variable which varies only cross-sectionally. Hence, LSDV precludes the inclusion of time-invariant variables in a model. An alternative that does allow time-invariant covariates is a GLS model with a compound symmetry covariance structure, known as a or a random effects model. This model recognizes that repeated observations will covary, so the estimator accounts for this structure by including a term that forces all repeated observations to correlate at a constant level with each other.

It should be noted that the terms fixed and random effects have multiple meanings. Econometricians typically call a LSDV model a fixed effects model and a GLS-compound symmetry model a random effects model. These terms take a different meaning when analyzing data from the view of hierarchical modeling. Specifically, a fixed effect refers to any model quantity estimated in the fitting of a model (i.e., obtained via least squares or maximum likelihood), while a random effect refers to any parameter that is unique to the individual but can be predicted separately. Mixed effects models contain both fixed and random effects. Confusion can arise because a “random effects model” is a special case of a mixed effects model. For example, the general form of a linear mixed effects model is:

Yij=X'ijβ+Z'ijbi+eij

Where Yij is the response value for individual i at time j, Xij is the vector of all covariate values for individual i at time j, β is a vector of fixed effects—coefficients that apply to all individuals, Zij is a subset of X which may include any time-varying covariate or a constant, bi is a vector of random effects for individual i, and eij is the error term for individual i at time j. One special case would be a model in which there is only a random intercept, which becomes:

Yij=X'ijβ+bi+eij

Each bi is not estimated directly in the fitting of the model, but can be predicted using empirical Bayes techniques. By decomposing the unexplained variance into bi and eij, which are independent of each other, the model successfully accounts for differences in the mean responses for individuals and the necessary correlation among observations. Hence, the random effects model is seen as a special case of the more general mixed effects model.

There are several practical considerations when deciding how to control for unit effects in a longitudinal model. Again, fixed effects models cannot include time-invariant covariates. Further, when the number of individuals is large, especially relative to the number of waves, then estimating a LSDV model is inefficient. An alternative fixed effects estimator to LSDV is the within estimator, wherein the outcome variable and the covariates are all rescaled as deviations from an individual’s mean of the variable. The within estimator avoids the inefficiency of estimating unique intercepts for each individual, and yields the same coefficient estimates at LSDV; however, just like LSDV it cannot accommodate time-invariant covariates.

The model of random effects for units allows for time-invariant covariates and avoids the inefficiency problem that could emerge from LSDV. Hence, with especially short panels or any model for which the effects of time-invariant covariates are to be estimated, random effects models are probably the most practical option. However, this model assumes that unit effects are independent of all covariates. If the unit effects are correlated with any of the input variables, then the random effects model is biased and inconsistent. Whether or not independent unit effects is a fair assumption can be evaluated with a Hausman test, under which the null hypothesis is that the unit effects are independent, implying that a random effects model is consistent. Rejection of this null hypothesis implies that the random effects model has problems of endogeneity bias. As a final, practical point on random effects models: GLS models such as this require the analyst to specify how the errors of the model are correlated. However, the true correlation between the errors of individuals’ repeated measurements is unknown, so feasible GLS must be used. Feasible GLS is estimated with a multi-step procedure whereby residuals of an initial model are used to estimate the correlation of errors, which is then inserted into a GLS estimator. For instance, the Cochrane-Orcutt FGLS estimator repeats this iterative process until the estimate of correlation of errors ceases to improve. All of this suggests that analysts must carefully weigh the structure of their data and the goals of their model when choosing how best to handle unit effects.

Since this is probably the most important part of the entry for many readers, could you add more on diffs, advantate, disadvantages beyond the hints (eg asusmption of orthogonality) - and yes, you can have more words to do this, please try to keep to 2400 (and this means 24xx!)

Serial Correlation, Heteroscedasticity & Contemporaneous Correlation

Serial correlation refers to the fact that repeated observations on the same individual are highly correlated. In general, this correlation tends to be large and positive, but diminishes as the time between measurements increases. Serial correlation violates the OLS assumption of uncorrelated errors. The solutions to this problem resemble the fixes for unit effects. One solution is to include a lagged response as a covariate, as this term often accounts for serial correlation and makes the remaining errors independent. Lagged outcome variables are more commonly used for long panels because the first wave of observations cannot be modeled with this approach, which is more costly when repeated observations are scarce. (It should be noted that while many argue a lagged response most effectively accounts for unit effects and serial correlation, others maintain that an endogeneity bias can occur if the lagged term does not filter all of the serial correlation.) Another solution is to estimate a GLS model that includes a covariance pattern matrix, which estimates the covariances between each pair of time waves: the matrix may be unstructured or defined by a clear pattern, such as first-order autoregressive. Lastly, mixed effects models produce correlation matrices based on the variances and covariances of the random effects. Thus, a pattern of correlation also can be captured by random effects. It should be noted that in particularly short panels (for instance, three waves), serial correlation can be hard to account for with any of these methods: A lagged dependent variable costs one wave of data, covariance pattern matrices more complex than a simple random effects model can be difficult to estimate, and very short panels do not allow for a lot of random parameters..

The methods for covariance patterns and random effects also can be incorporated into the general linear model framework, which means that remedies for unit effects and serial correlation also can be used for limited dependent variables (such as counts or binary outcomes). Marginal models, estimated with the generalized estimating equations, require the analyst to specify how repeated observations are associated and thereby resemble the covariance pattern GLS model for continuous outcomes. General linear mixed effects models incorporate random effects into the specification and account for the correlation of repeated observations through the random effects.

Heteroscedasticity can be present in panel data if the unmodeled variance in outcomes differs from one individual to the next. This problem can be addressed through a GLS estimator which allows for unique variances among individuals, in addition to the correlation pattern. Contemporaneous correlation arises when individuals have similar errors at particular times. This may arise because some time-dependent factor is simultaneously influencing all individuals. In the presence of contemporaneous correlation, the error variance of linear coefficient estimates increases relative to the estimates of coefficients when Gauss-Markov assumptions hold. Regular standard errors do not account for this inefficiency, however. Rather, panel-corrected standard errors will better account for the larger error variance, thereby making statistical inference on coefficient estimates less prone to Type I errors.

Missing Data

With panel data, a key concern is that measuring each individual at every wave of observations may not be possible. One reason for this may be censoring that arises from the structure of the study. For example, if different individuals were recruited to participate in a study with staggered start times, but the study ended simultaneously for all, then late joiners would have fewer repeated observations. In this situation, the non-observance of later waves for late joiners would be missing completely at random (MCAR), as qualities of the individuals had no bearing on how often they were observed. In this case, as with any panel data with observations missing completely at random, the data could be analyzed by complete-case analysis (analyzing only cases for which all waves are observed) or available-data analysis (i.e., methods that do not require response vectors of equal length).

Another, more serious cause of missing data in panels is attrition (also called dropout or panel mortality). Individuals who are part of the study may choose not to participate after a few observations, or the researcher may lose track of individuals and be unable to reach them for further study. Dropout specifically refers to the situation where, once an individual goes unobserved in one wave, the individual is not observed in any future wave. Whereas data that are missing due to censoring are nearly always MCAR, data missing due to dropout may not fit this criterion. If the data are at least missing at random (MAR, meaning that the probability of missingness is conditional only on observable information), then imputation methods can yield unbiased estimates of model quantities. Many studies impute missing values from dropout by assuming all missing values of a response are equal to the last observed value. This method assumes, however, that the responses would not have changed since dropout, which is usually unrealistic.

A better alternative is multiple random imputation. One technique for multiple random imputation is to model the probability of missingness and match missing observations with observed observations that have similar probabilities of being missing, randomly drawing several observations with similar probabilities of being missing to impute for the missing value. A second technique for multiple random imputation is to model the value of an observation at a particular time with the observed data and impute a value for the missing observation that is computed with known information about the subject plus a random disturbance. For individuals who later return to the study, observations at later waves—as well as early waves—should be used to impute missing middle values.

As a final consideration for attrition, a researcher may choose to refresh the sample by adding new observations toward the end of the study, these new observations being called a refreshment sample. Though this strategy does not help with the analysis of individual panelsdirectly remedy the problem of uneven panels, it does prevent the sample size from shrinking too much when constructing an overall response profile.

A paragraph on refreshment?

Diagnosis of problem by simple comparisons (as in Bartels)?

I care more about getting this to be incredibly useful than length - but you do not want to double the size Further, refreshment allows the researcher to diagnose the severity of panel effects. This can be done by comparing variable means from the refreshment samples to the means of those still in the panel at a given wave to see how dropout is influencing the makeup of the sample. Further, the process of being part of a panel study may influence an individual’s response over time, a process called “panel conditioning.” Refreshment samples allow for the possibility of adjusting for panel effects through techniques such as “fractional pooling” or “two-stage auxiliary instrumental variables.”

James E. Monogan III

See also: Generalized Least Squares, General Linear Models, Multiple Random Imputation, Regression, Time Series Analysis, Time-Series Cross-Section

Further Readings

Allison, P. D. (1994). Using Panel Data to Estimate the Effects of Events. Sociological Methods & Research, 23, 174-199.

Bartels, L. M. (1999). Panel Effects in the American National Election Studies. Political Analysis, 8, 1-20.

Beck, N. & Katz, J. N. (1995). What to Do (and Not to Do) with Time-Series Cross-Section Data. American Political Science Review, 89, 634-647.

Cameron, A.C. & Trivedi, P. K. (2005). Section V of Microeconometrics: Methods and Applications. New York: Cambridge University Press.

Fitzmaurice, G., N. Laird, & J. Ware (2004). Applied Longitudinal Analysis. Hoboken, NJ: Wiley.

Stimson, J. A. (1985). Regression in Time and Space: A Statistical Essay. American Journal of Political Science, 29, 914-947.

1