Regression in GRETL Using Panel Data

I. Introduction

Typically we have three types of data sets which we use in economics:

(1)Time series – This is the most common form of data that we use and they are quite easily accessible. You can see time series data in the Taiwan Statistical Databook, Central Banks websites and publications, the Economic Report of the President, the Bureau of Labor Statistics, the Census Bureau, the Asian Development Bank and at websites like economagic.com and the Directorate of Budget Accounting and Statistics (DGBAS). Time series regression must face the formidable problems of autocorrelation and structural change.

(2)Cross Section – This is data usually observed over geographic or demographic groups. For example, we can observe data on the unemployment rate for each of the 50 US states plus Washington, DC. This would give us 51 observations on a single variable – unemployment. We can then find cross sectional data on other variables which we think are related to the unemployment rate, such as the suicide rate. Cross sectional data is usually found in publications like the Statistical Abstract of the United States. The ADB has data on each of the countries in the Asian Pacific Region. The OECD has data on the countries in Europe, along with the US and Japan. Many of the state governments keep very good statistics on each of the counties in a state. A regression, which uses these cross section data sets, is called a cross sectional regression. Cross sectional regressions usually suffer from the problem of heteroskedasticity. Moreover, they are really only true for a moment in time and therefore there is always the lingering question of whether they can adequately represent the unchanging structure we are researching.

(3)Panel Data – This type combines the first two types. Here we have a cross section, but we observe the cross section over time. If the same people or states or counties, sampled in the cross section, are then re-sampled at a different time we call this a longitudinal data set, which is a very valuable type of panel data set. Longitudinal data sets are very common in medical and biostatistical studies. Panel data sets are becoming more and more popular due to the widespread use of the computer making it easy to organize and produce such data.

This lecture discusses how to read panel data sets into GRETL and then illustrates how we can run regressions on paneled data using GRETL.

  1. Organizing Panel Data

The best way to explain panel data organization is to use a simple example. Suppose that we have a country with five regions, listed as A, B, C, D, and E.

We can think of this map above as depicting a country having five states. Now, suppose that we travel to each state and collect data on two variables – X and Y. We suspect that Y is determined by X and therefore we assert there is a stable structure which we can capture using the regression

The next year, we again sample each state to get data on X and Y.

We therefore have two years of data on X and Y for each of the states A, B, C, D, and E. This means we have 10 observations on Y and X (i.e., 5 cross sectional units * 2 time periods).

If we put all the data together and do not make any distinction between cross section and time series, we can of course run a regression over all the data using ordinary least squares. This is called a pooled OLS regression. This type of regression is the easiest to run, but is also subject to many types of errors. We could simply write the data in observation form like the following (data is hypothetically created).

Pooled Data Organization

_k Xk Yk

1 7 1

2 7 1

3 12 2

4 15 3

5 21 3

6 10 2

7 14 3

8 18 4

9 22 4

10 25 5

Pooled OLS is often used as a rough and ready means of analyzing the data. It is a simple and quick benchmark to which more sophisticated regressions can be compared.

If we make distinctions between time and cross sectional parts of the data, which is more sophisticated and more informative, there are two ways we can organize this data into tables.

We must be careful how we organize the data so that GRETL can accept our formulation and can help us compute our estimates.

(1)Stacked Cross Sections

State = i and Time = t Xit Yit__

i = A and t = 1 7 1

i = B 7 1

i = C 12 2

i = D 15 3

i = E 21 3

------

i = A and t = 2 10 2

i = B 14 3

i = C 18 4

i = D 22 4

i = E 25 5

The organization above is called “stacked cross sections” since each block of cross sectional data is stacked on top of itself over time. That is we have the cross section for time period 1 and, then below that, we have the cross section for time period 2, etc..

Our regression can now be written as

Note that any Yit or Xit can be identified from the table above. For example, YE2 = 5 and XC1 = 12.

(2) Stacked Time Series

State = i and Time = t Xit Yit__

i = A and t = 1 7 1

i = A t = 2 10 2

------

i = B and t = 1 7 1

i = B t = 2 14 3

------

i = C and t = 1 12 2

i = C t = 2 18 4

------

i = D and t = 1 15 3

i = D t = 2 22 4

------

i = E and t = 1 21 3

i = E t = 2 25 5

This table above consists of five time series (each one having only two observations).stacked on top of each other.

If we were to use EXCEL to organize our data (perhaps from a website or some other source) we could set up the data file in either of two ways. Consider the two EXCEL files shown in the figure below on the next page. Note how that the first row is reserved for the names of the variables. For the “left hand side EXCEL file”, rows 2-6 represent the cross section for period t = 1 and rows 7 – 11 represent the cross section for t = 2. For the “right hand side EXCEL file”, rows 2-3 --taken together -- is the time series for cross sectional unit A; rows 4-5 -- taken together -- is the time series for cross sectional unit B; …and so on.

The EXCEL files shown above should be saved as ____.CSV files, or comma separated variable files, which is an option in EXCEL. GRETL will not read common EXCEL files in. It will accept ___.txt and ___.csv files, as well as some other formats. The ___csv files appear to be the easiest to use.

Having shown how panel data can be arranged and having prepared EXCEL ____.csv data files to be read into GRETL, we now turn to the reading-in and analysis of panel data sets using GRETL.

III. Panel Data Analysis in GRETL

Suppose that we have a panel data set organized as stacked cross sections (as above) and named panel.csv. We can read this into GRETL using the following commands:

This will be followed by a series of windows, some of which are shown below

Once the data has been read into GRETL we will need to change the data structure to panel data. This is done by choosing DATA / DATASET STRUCTURE

A series of dialog boxes next appears which requires some choices. We choose “Panel” in the first dialog box and “Stacked cross sections” in the second dialog box, since our data uses the “stacked cross sections” organization..

The next dialog box asks how many cross sectional units there are. In our data set we have 5 cross sectional units (i.e., A, B, C, D, and E) over two period t = 1,2. Usually there will be many cross sectional units and several time periods.

We are then asked to confirm the data structure as paneled data (stacked cross sections) having five cross sectional units over two time periods.

We can display the data now in the (i,t) form

Under the column heading Obs an entry such as 2:1 means i = B and t = 1 and therefore Y21 = YB1 = 7. Our data has been read in and is now ready to be analyzed using the subroutines in GRETL.

We can, of course, follow the same steps when our EXCEL files are stacked time series. In this case, we merely need to check the correct choice in the dialog boxes above. The end result will be the same; namely we will have a panel data set read-in to GRETL and ready to be used.

It is important to realize that no matter whether your data is organized as stacked cross sections or stacked time series, GRETL always stores and displays the panel data as stacked time series. This is clear from the figure directly above which shows that our data is displayed as stacked time series even though it was read-in as stacked cross sections.

IV. Panel Regression

Panel data allows the researcher to consider more general models than the simple pooled OLS model we discussed earlier. In particular, we can now assume that the constant term for each state (A, B, C, D, and E) differs. We can write this as

for i = A, B, C, D, and E and for t = 1 and 2. Each ai is a separate constant associated with a different state. This means that the actual constant term for each state is equal to. In fact, we can estimate these constant terms, but they are seldom of much importance and their estimated values are difficult to judge because there is so little data being used to estimate them (two observations for each using the above data set). Instead, we are typically more interested in the slope coefficient,.

There are many ways we can estimate the above regression, assuming the ai’s are all constant. GRETL uses a method called “fixed effects”, which is so named because of the fixed constants assumption. By subtracting the overall means from Y and X in the regression equation, we can eliminate the constant terms and get

We can now apply OLS to this regression and estimate as. This is done in the following way in GRETL.

The choosing of the “fixed or random effects” option produces another dialog box.

The output from this regression can be printed out as

Note how that no constant is estimated in the regression. If the estimates of the individual constants are desired, these can be accessed by first saving them (the per-unit constants) and then printing them out.

A print out of these constants (there are five) can then be easily accomplished.

The random effects model is just as easy to estimate. One only chooses the random effects option in the dialog box for panel models.

The random effects model assumes that the ai’s are all random variables. The Y’s and X’s are appropriately transformed and both and are estimated as follows: