Cross Section Data Analysis

Cross Section Data Analysis

Practical Guide For

Cross Section Data Analysis

Using EViews

I Gusti Ngurah Agung

GraduateSchool of Management

Faculty of EconomicsUniversity Of Indonesia

Ph.D. in Biostatistics and

MSc.in Mathematical Statistics from

The University of North Carolina at Chapel Hill

1

Data Transformation

It is recognized that the very basic analysis based on any cross-section data set is to study the differences between the studied objects, either as the individuals or groups of individuals, which should have been known to have different or distinct characteristics.

For this reason, the classification analysis should be a very important part in the cross-section data analysis, with the very basic data analysis is the data analysis based on a zero-one problem indicator. For this reason, this chapter presentshow to generate a dummy variable, since the dummy variables are widely used in various statistical models based on all cross-section data sets.

By selecting Quick/Generate Series…, the adummy variableas well as a categorical variable can easily be generated using the block-copy-paste method of the equation specificationspresented below to the dialog or window.

1.1 Generating Dummy Variables

1.1.1 Dummy Variables For A Single Categorical Variable

The dummy variable of k-th category of a variable V, namely DVk, for k=1,…,K, can be generated using the equation as follows:

DVk=1*(V=k) or Dk=1*(V=k) (1.1)

1.1.2 Ordinal Categorical Variables Based On A Numerical Variables

An categorical variable should be generated using the alternative methods as presented in Chapter 2, and then the dummy variables can easily be generated using the formula (1.1). For example, the ordinal categorical variable having k-categories can be generated based on a numerical variable X, namely CkX, using the following equation specification.

C2X = 1 + 1*(X>=a)

C3X = 1 + 1*(X>=a) + 1*(X>=b)

C4X = 1 + 1*(X>=a) + 1*(X>=b) + 1*(X>=c)

C5X = 1 + 1*(X>=a) + 1*(X>=b) + 1*(X>=c) + 1*(X>=d) (1.2)

where a< b < b < c < d should be intentionally selected numbers, by using the following alternative statistics.

1). Using nonparametric statistics, such as the quantiles of the variable X, by using the function @quantile(X,q).

For an example, a = @quantile(X,0.30), and b=@quantile(X,070), to generateC3X.

2). Using parametric statistics, suchas the mean and standarddeviation of the variable X, whichcan be computed as m = @mean(X) and sd = @stdev(X).

For an example, to generate C4X,a =m -1.5*sd, b= m, and c=m+1.5*sd

3). On the other hand, one may used the Z-score of X, namely ZX, having m=0 and sd=1. For an example, to generate C4ZX,a = -1.5*sd, b= 0, and c=+1.5*sd

4).Usingpersonal judgment of the researcher based on the data set available.

1.1.3 Dummy Variables For Two-Way Tabulation

Based on a two-way tabulation, say anIxJ- table of the factors A and B, it is suggested the dummy variables should be presented as the dummy-cells, namely DCij, for i=1,…,I and j=1,…,Jby using the following equation, which can easily be extended toN-way tabulation.

DCij =1*(A=i and B=j) (1.3)

And based on a IxJxK-table of the factors A, B and C, it is suggested to generate the dummy variables using the following equation.

DCijk =1*(A=i and B=j and C=k) (1.4)

1.2 Generating a Cell-Factor (CF)

Since EViews 6 provides a function, namely @Expand(CF), which can directly transform the cell factor CF into its dummy variables, then it is suggested to generate a cell-factor in doing analysis based on a N-way tabulation. For the illustration find the following cell-factors, which can easily be extended for IxJ and IxJxK tabulations, as well as for N > 3.

1.2.1 Based on a 2x2-Tabulation of the factors A and B

CF=11*(A=1 and B=1)+12*(A=1 and B=2)

+21*(A=2 and B=1)+22*(A=2 and B=2) (1.5)

1.2.2 Based on a 3x2-Tabulation of the factors A and B

CF=11*(A=1 and B=1)+12*(A=1 and B=2)+21*(A=2 and B=1)

+22*(A=2and B=2)+31*(A=3 and B=1)+32*(A=3 and B=2) (1.6)

1.2.3 Based on a 2x2x2-Tabulation of the factors A, B and C

CF=111*(A=1 and B=1 and C=1)+112*(A=1 and B=1 and C=2)

+121*(A=1 and B=2 and C=1)+122*(A=1 and B=2 and C=2)

+211*(A=2 and B=1 and C=1)+212*(A=2 and B=1 and C=2)

+221*(A=2 and B=2 and C=1)+222*(A=2 and B=2 and C=2) (1.7)

2

Single-Factorial Regression Models

In this case, for the data analysis, it is considered a cell-factor CF which can be a single factor or generated based on two or more categorical factors, a set of exogenous variables or covariates, namely X1, X2, … , XK, where Xkfor eachk=1,…,K; can be a main factor, two- or three-way interactions, and Y is an endogenous numerical, zero-one or ordinal variable.

As an extension, if there is a classification factor generated based on an exogenous numerical covariate, then the regression would be a piecewise regression model.Each of the equation specification can be used to conduct the analysis based on several estimation settings as follows:

1). For the numerical variableY, the estimation setting is the LS-Least Squares.

2). For the zero-one variableY,the estimation setting is the BINARY Choice.

3). For the ordinal variableY, the estimation setting is the ORDERED Choice, with a small modification on the intercept parameter.

4). For the censored observationY, the estimation setting is the CENSORED.

Note that everyone can easily modify each equation specification by using the transformed of the endogenous or exogenous numerical variables in order to have alternative models, such as polynomial, semi-logarithmic, and trans-log linear or quadratic models, as well as the bounded regression models by using an independent variable log((Y-L)/(U-Y)) where L and U, respectively, are the subjectively selected lower and upper bounds of the variable Y.

The data analysis can easily be conducted by selecting Quick/Estimate Equation…, and then using the block-copy-paste method to insert each of the following equation specification to the dialog or window. Finally, having the result on the screen, various corresponding hypotheses can easily be tested using the Wald test, as well as additional analysis, the residual analysis in particular.

2.1One-Way ANOVA Models

The main objectives of an ANOVA model are to test various hypotheses on the mean differences between the levels of a single factor or the cells generated by two or more (nominal) categorical factors. In order to write the statistical hypotheses basedoneach model, a table of the model parameters should be constructed. Refer to the corresponding table presented in the main book!

2.1.1 One-Way ANOVA Model without Interceptor “C”

Y @Expand(CF) (2.1)

2.1.2 One-Way ANOVA Model with an Interceptor “C”

Y C @Expand(CF,@Dropfirst) (2.2)

Y C @Expand(CF,@Droplast) (2.3)

Y C @Expand(CF,@Drop(*)) (2.4)

where (*) indicates a level of the cell-factor CF. For examples, an integer for a single factor, namely (k), a pair of integers for a two-way tabulation, namely (i,j), and (i,j,k) for the cell-factor of a three-way tabulation.

2.1.3Alternative Methods

1). Having the data of the variable Y on the screen, various statistics andtesting hypotheses can be derived, such as follows:

1.1 By selectingView/Descriptive Statistics &Tests, Figure 2.1(a) shown on the screen, which shows eight options, namely the graph with several alternative options, six options for the Descriptive Statistics & Test, and One-Way Tabulation, for the cross-section data. Everyone can easily obtain each output. Note that descriptive statistical summaries are very important part of all evaluation studies. Refer to Chapter 2.

Figure 2.1 The Options for the Data Analysis based on a Single Variable

1.2 In addition, numerical variable(s) also can be inserting as the series for classify. In this case the “Max # of bins:” should be taken into account.

2). Having the data of both Y and CF, specifically if Y is a zero-one or ordinal variables, on the screen various statistics andtesting hypotheses can be derived, such as follows:

2.1 By selecting View, Figure 2.2 shown on the screen, and then by selecting N-Way Tabulation, and the output options…OK, the statistical results, specifically the descriptive statistical summary andthe Chi-square statistics, are obtained. Refer to Chapter 2.

Figure 2.2 The Options for the Data Analysis based on a Set of Variablesw

Figure 2.3 The Options for the Data Analysis based on a Pair of Variables

2.2Specifically for CF is an ordinal variable, by selecting Covariance Analysis…, the option in Figure 2.3(a) shown on the screen, and then by selecting theKendall’s tau method, Figure 2.2(b) shown on the screen.

2.3Finally by selecting the options Kendall’s Score S/Probability |S|=0 …OK, the nonparametric statistical results are obtained for testing a numeric-ordinal variables association. In order words, to test the effect of an ordinal variable CF on a numerical Y. In fact, the Kendall’stau also can be applied for the bivariate numerical variables. Note that the Spearman rank-order also can applied.

2.2One-Way ANCOVA Models

The main objectives of an ANCOVA model, or an homogeneous regression model, are to test various hypotheses on the adjusted-means differences between the levels of a single factor or the cells generated by two or more (nominal) categorical factors. In order to write the statistical hypotheses based on each model, a table of the model parameters should be constructed. Refer to the corresponding table presented in the main book! Then everyone should be able to write the equation of the regression within all levels or cells.

In addition, the hypotheses on various types of the effects of the covariate(s) on the dependent variable should be theoretically defined, and then can be tested.

2.2.1 One-Way ANCOVA Model without Intercept or “C”

Y X1 . . . XK @Expand(CF) (2.5)

Note that the list of the variables, specifically the covariates, is presented exactly the same as the list in the output. For a comparison, even though “Y @Expand(CF) X1 … XK” is used as the equation specification, the output will present the model parameters as C(1), C(2), … in the ordering of the list in the equation specification (2.5).

2.2.2 One-Way ANCOVA Model with an Intercept or “C”

Y X1 . . . XK C @Expand(CF,@Drop(*)) (2.6)

Note that to generalize, a covariate Xk can be either a main factor, two- or three-way interaction-factors. For an example, for the covariate X1 and X2, the following ANCOVA model would be considered. In this case, it is theoretically defined that the effect of X1 (X2) on Y depends on X2 (X1).

Y X1 X2 X1*X2 C @Expand(CF,@Droplast) (2.7a)

For an illustration Table 2.1 presents the parameters of the model in (2.7a) for a CF having four levels.

Table 2.1 The Parameters of the Model in (2.7a) for a CF having four levels

X1 / X2 / X1*X2 / CF=1 / CF=2 / CF=3 / CF=4
C(1) / C(2) / C(3) / C(4)+C(5) / C(4)+C(6) / C(4)+C(7) / C(4)

For a comparison Table 2.2 presents the parameters of the following equation specification for a CF having four levels.

Y X1 X2 X1*X2 C @Expand(CF,@Dropfirst) (2.7b)

Table 2.2 The Parameters of the Model in (2.7b) for a CF having four levels

X1 / X2 / X1*X2 / CF=1 / CF=2 / CF=3 / CF=4
C(1) / C(2) / C(3) / C(4) / C(4)+C(5) / C(4)+C(6) / C(4)+C(7)

Note that in practice, a nonhierarchical ANCOVA model could be a good fit model, oran additive model having only the main factor X1 and X2, which is highly dependent on the data set used.

Furthermore, for the covariate X1, X2, and X3, the hierarchical ANCOVA model would be as follows, where X1, X2 and X3 are defined to have a complete association. The data analysis based on a reduced model can easily be done by using the block-copy-paste method of this equation.

Y X1 X2 X3 X1*X2 X1*X2 X2*X3 X1*X2*X3

C @Expand(CF,@Drop(*)) (2.8)

2.2.3Suggested Additional Analysis

By having a set ofnumerical variables of Y and covariates, it is suggested to conduct the bivariate correlation analysis, namely Covariance Analysis in EViews, using the options presented in Figure 2.3.

Note that the impact of the multicollinearity between the independent variables of any model is unpredictable, and the output can present unexpected parameter estimate(s). Refer to the special notes and comments in Agung (2009, Section 2.14.2).

Finally, for a more advanced data analysis, everyone may consider in conducting the residual analysis, and find the following special illustrations or findings. Refer to various examples in Agung (2009).

Example 2.1 (A Special Illustration). It is recognized that the following three alternative analyses based on a pair of numerical variables X and Y, which will show exactly the same values of the t-statistics.

1). The correlation of (X,Y) by using EViews 6,

2). The OLS regression of Y on X, using the equation specification “Y C X, and

3). The OLS regression of X on Y, using the equation specification “X C Y”.

These findings indicate that the correlation analysis is sufficient for testing the causal linear effect of X and Y, including their simultaneous causal linear effects, which should be theoretically defined. In other words, the testing hypothesis should not be used to prove that X and Y have causal effects.

Example 2.2 (Another Special Illustration). It is recognized that the following two ANCOVA model of Y on X, and ANCOVA model of X on Y, also give the same t-statistics for testing the hypothesis on the effect of the independent numerical variable on the dependent variable, adjusted for the cell-factor CF.

Y X @Expand(CF) or Y C X @Expand(CF,@Drop(*)) (2.9a)

X Y @Expand(CF) or X C Y @Expand(CF,@Drop(*)) (2.9b)

2.3Heterogeneous Regression Models

The main objectives of an heterogeneous regression model are to test various hypotheses on the slopes differences between the levels of a single factor or the cells generated by two or more (nominal) categorical factors. In other words, to test various hypotheses on the differences of the effect of a covariate or numerical independent variable, adjusted for the other independent variables. In order to write the statistical hypotheses based on each model, a table of the model parameters should be constructed. Refer to the corresponding table presented in the main book! Then everyone should be able to write the equation of the regressions within all levels or cells.

Note that the intercepts differences of the regressions should not be taken into consideration, more over for testing hypothesis on their differences.

2.3.1 Heterogeneous Regression Model without Intercept

Y @Expand(CF) X1*@Expand(CF) . . . XK*@Expand(CF) (2.10)

2.3.2An Alternative Heterogeneous Regression Model

Y X1… XK@Expand(CF) X1*@Expand(CF,@Dropfirst) . . .

XK*@Expand(CF,@Dropfirst) (2.11)

2.3.3 Heterogeneous Regressions Using Dummy Variables

It is recognized that an heterogeneous regressions would have different sets of independent variables within the defined levels of CF. For this reason, it is suggested to using the dummy variables if and only if the regression models or their reduced models within the levels of CF have different sets or exogenous variables. For an example, based on the following full model, everyone may have various good fit reduced models which are highly dependent on the data sets used. Note that to generalize, a covariate Xk= Xkcan be either a main factor, two- or three-way interaction-factors.

Y = (C(10)+C(11)*X1 +C(12)*X2+C(13)*X3+C(14)*X4+C(15)*X5)*D1

+(C(20)+C(21)*X1 +C(22)*X2+C(23)*X3+C(24)*X4+C(25)*X5)*D2

+(C(30)+C(31)*X1 +C(32)*X2+C(33)*X3+C(34)*X4+C(35)*X5)*D3

+(C(40)+C(41)*X1 +C(42)*X2+C(43)*X3+C(44)*X4+C(45)*X5)*D4 (2.12)

Note thatthis equation specification can easily be modified for a cell-factorCF having any number of levels, as well as any sets of independent variables, by using the block-copy-paste method. Furthermore,it is suggested to save the full model and all its reduced models should be using the same symbols of the parameters.In other words, an independent variable Xk will have the same symbol C(ik), i = 1,2,3 or 4, in the full and reduced models.

2.4 Alternative Regression Models

As the extension of the models presented above, several alternative regression models, either linear or nonlinear models, can easily be definedbased on a set of numerical variables, as presented in Table 2.3.Corresponding to these models the following remarks are presented.

Table 2.3Equations Specifications of Alternative Regression Models

based on a Set of Numerical Variables

Dept.Var / No. / Independent Variables
Y, Log(Y),
Log(Y-L),
Log(U-Y), or
Log((Y-L)/(U-Y)) / Single Exogenous Variable X
1 / C X X^2 … X^k
2 / C(1) + C(2)*(X-a)^2*(X-b)
3 / C log(X)
4 / C log(X) log(X)^2 … log(X)^k
Two Exogenous Variables X1 and X2
5 / C X1 X2 X1*X2
6 / C log(X1) log(X2)
7 / C log(X1) log(X2) log L(X1)^2 log(X1)*log(X2) log(X2)^2
8 / C log(X1) log(X2) (log(X1) - log(X2))^2
Nonlinear Models
3a1 / = C(1)*X^C(2)
3a2 / = C(1)+C(2)*X^C(3)
6a1 / = C(1)*X1^C(2)*X2^C(3)
6a2 / = C(1)*X1^C(2)*X2^C(3) + C(4)
7a1 / = C(1)*(C(2)*X1^C(3)+(1-C(2))*X2^C(3))^(-r/C(3))
7a2 / = C(1)*(C(2)*X1^C(3)+(1-C(2))*X2^C(3))^(-r/C(3))+C(4)

1). The regression models in Table 2.3 can be viewed as the models without the cell-factor as an independent variable, which can easily be extended to the models having any multivariate exogenous variables. However, everyone should be aware on the unpredictable impact of the multicollinearity of the independent variables on the parameter estimates.Refer to Section 2.14.2 in Agung (2009). Even in some cases, an error message could be obtained, especially for nonlinear models.

2). Note thatL and U, respectively are the lower and upper bounds of Y, which are subjectively selected. On the other hand, a and b are specific selected numbers which are related to the predicted extreme values of the corresponding independent variable.

3). The nonlinear models 3a1 and 3a2 are related to the trans-log linear model-3, which could be the same as the Cobb-Douglas production function with an input variable. Similarly, the non-linear models 6a1 and 6a2 are related to the Cobb-Douglas production function with two input variables. Refer to Chapter 10 in Agung (2009a)

3). The nonlinear models 7a1 and 7a2 are related to the CES (Constance Elasticity of Substitution) production function, which can be approximated by using the trans-log quadratic model-7.

4). The trans-log linear model-8 is a special case of the model-7.

5). By using cell-factor CF as an additional independent variable, then various ANCOVA and Heterogeneous Regression Models, can easily be defined or derived, as the models previously presented.

6). If the cell-factor is generated based on one or two exogenous numerical variables, then the discontinuous regression models, either peace-wise or step regressions, would be obtained.

7). By following the equation specification in (2.12), everyone can easily applied various sets of exogenous variables as well as models within each levels of a defined cell-factor.

8). As a further extension of the models, specific for peace-wise or step regression models, different transformed variable of Y can be applied within each peaceorlevel of the cell-factorCF, where CFis generated based on one or two numerical exogenous variable.For an example,Y_New = log(Y) for CF=1, and Y_New = log((Y-L)/(U-Y)) for CF = 2. Refer to Agung (2009a).

3

Bi-Factorial Regression Models

In this case, it is considered two categorical factors A and B, a set of covariates, namely X1, X2, … , XK, where Xk for eachk=1,…,K; can be a main factor, two- or three-way interactions, and Y is an endogenous numerical, zero-one or ordinal variable. In fact, these models can be viewed as the one-way regression models. Thence, all notes presented above are valid for these models.

The objective of this chapter is to present alternative equation specifications, which should be considered as more complex equations. However, they have advantages for doing the testing hypotheses, since their statistical results present the test-statistic, such as the t-statistic or Z-statistic, for the hypotheses based on each selected model. Refer to the main book!

3.1Two-Way ANOVA Models

Thealternative equation specifications are as follows:

3.1.1 The Simplest Two-Way ANOVA Model without Intercept “C”

Y @Expand(A,B) (3.1)

3.1.2 Two-Way ANOVA Model with an Intercept

Y C @Expand(A,B,@Drop(*,*)) (3.2)

where (*,*) indicates a cell generated by the two factors A and B.

3.1.3 Alternative Two-Way ANOVA Models