Econ 301B - Applied Statistics and Econometrics

UNIVERSITY OF OSLO

DEPARTMENT OF ECONOMICS

Exam:

ECON 301B - APPLIED STATISTICS AND ECONOMETRICS

Date of exam: Monday, 9 December 2002

Time for exam: 9 a.m. - 3 p.m

The problem set covers 7 pages including computer output

Resources allowed: All printed books and private notes as well as calculators.

All questions should be answered..

The grade scale is A,B,C,D,Fail (with A as best grade) .

Comments are given in arial font after each question.

Some of the views expressed in the comments could have been different. It is the coherence and quality in the argument that matters.

Scientific journals constitute the medium of communication between scientists, and also the memory (storage) of science. The economics of (scientific) journals is interesting. Bergstrom[1] argues that journals owned by private publishers are grossly overpriced, and he recommends several actions to reduce the large profits made by these publishers. Bergstrom provides data to substantiate his case. There are 180 economic journals in his database, of which 16 are published by scholarly societies such as the American Economic Association. These 16 journals are published on a non-profit basis, as opposed to the remaining journals that have private publishers. We shall concentrate on the following variables:

: Library subscription price for the journal per year.

: Number of libraries subscribing to the journal.

: Total number of times papers in the journal were cited in 1998.

: Age of the journal.

: Number of pages in the journal in 1998.

: Binary variable (dummy); 1 if non-profit (scholarly society), 0 otherwise.

It is rare that an article in an economics journal is as explicit as Bergstrom in its policy recommendations aimed at reducing the profits of economic agents, but Bergstrom clearly has a dual role: as disinterested analyst, and as an academic economist with an economic interest. In his section “What can we do”, Bergstrom suggests: (i) To expand the much cheaper and also generally better non-profit journals owned by professional societies. (ii) To support new electronic journals. And (iii) to punish overpriced journals by cancelling library subscriptions, defecting editorial boards, not sending good papers to these journals, and refuse to referee papers from them.

(a)In the Figure 1 in the appendix, price is plotted against number of pages . The circles represent non-profit journals. Comment on the graph.

Price is clearly rising with number of pages for both private and non-profit journals, but apparently more so for private journals. The graph shows that variability in price increases with number of parameters. Whether price is linearly related to number of parameters is hard to tell. Too much noise in the graph!

(b)Figure 1 does not show a relationship between and that agrees well with the classical assumptions behind OLS. Why? Explain from the figure why might be close to linear in , and that the classical assumptions might be better satisfied on this log-log scale. Use as a prefix to denote logged variables throughout.

The OLS assumption of homoscedasticity is clearly not met for price versus number of pages, according to Figure 1. The log-transformation will stretch the lower end and contract the upper end of variation. It is concave! To take log of price will counteract the obvious higher variation at higher number of pages. To also take log of number of pages will hopefully preserve, and perhaps improve on the possible linear pattern in the graph.

(c)A matrix of pair-wise scatter plots for logged variables is given in Figure 2 for non-profit journals, and in Figure 3 for privately published journals. Regarding as the response variable, how does this variable seem to respond to the other variables? You might comment further on the plots, but be brief.

The upper right corners in Figures 2 and 3 show the scatter of versus . The scatter in Figure 2 has a small number of points, and one must therefore guard against over-interpretation. The scatter in Figure 3 shows a fairly nice linear and homoscedastic scatter for the 164 private journals, perhaps with slightly more variability in the lower end. This pattern is not contradicted by the scatter in Figure 2, other than the variability there seems larger at the upper end. Both these possible violations of homoscedasticity might be small sample illusions.

Upper rows in both graphs, from left to right: seems uncorrelated with and nearly so with . The latter is surprising, since many citations adds value to the journal. But this effect might be masked by other variables – we will see from the regressions. is positively correlated with both and. There is thus potential for masking the effect of . Figures 2 and 3 show no reasons not to analyse these data by linear methods (regression) on the log scale for the variables price, age, citations and number of pages.

(d)Consider the regression

where is a stochastic error term, and is the dummy variable defined on page 1. The OLS results for this regression are given in Table 1 in the appendix. Explain what is meant by R-squared and Adj R-squared. What are the interpretation of and respectively?

R-squared is the fraction of squared variation in the response variable around its mean that is “explained” by the linear regression (is recovered as variation in the predicted response due to variation in the explanatory variables) within the sample. . Adj R-squared is adjusted for the number of covariates in the regression – by adding covariates that have no theoretical correlation with the response, Adj R-squared will tend to (but not necessarily) decrease while R-squared will increase (or remain unchanged, in the unlikely event that the empirical correlation between the current residual and the new covariate is precisely zero).

measures the effect on the expected value of of a journal being non-profit versus private, given the same number of pages. The empirical result is that a non-profit journal is priced at about exp(-1.18)=.31 of a private journal with the same number of pages. measures the price elasticity with respect to number of pages. The elasticity in this model is assumed the same for both private and non-profit journals.

(e)A more general model to consider is

Would you interpret differently for this model than for the model in (d)? The OLS results for this model are given in Table 2, where etc., and . Calculate a 95% confidence interval for . What is your point estimate of the elasticity for private journals of median number of pages,? What is the estimated elasticity for a non-profit journal of the same size?

does not have quite the same basic interpretation. The expected difference in log price for two other wise identical journals, one being non-profit and the other private, is now , and not only . It is surprising that this estimated effect now seems positive: that non-profit journals are more expensive than private. But note the large standard error. The effect is not statistically significant!

The 95% confidence interval is . is the 0.975 quantile in the t-distribution with 171 degrees of freedom.

The estimated elasticity is for private journals, and for non-profit journals of median size.

(f)A rationale for introducing the interaction term is that private journals maximize profit, and the more cited a journal is the more valuable it is. Comment on the estimated signs of and . Discuss also the estimated signs for the other coefficients.

Both and are estimated to have negative effect on price. Both effects are non-significant, though. That seems to have a negative effect is surprising. I would expect the private publishers to try to maximize profit, and thus to price much cited journals higher. That seems to have a negative effect makes sense. Relative to the private, the non-profit journals would tend to be relatively cheaper the more valuable (more cited) the journal is – everything else equal.

The sign for is already discussed. The signs for , and must be discussed jointly, say by looking at the elastiticity that has been considered. What remains are the signs for and . The first is significantly negative. This is a bit surprising since older journals that have survived might be more valuable than newer journals (everything else equal), and private publishers should then be expected to price them higher. That has a positive sign, which is non-significant, might also be surprising. It is hard to see why non-profit journals get relatively more expensive than private ones the older they are.

(g)A third model is obtained by reducing model (e) to

The results by OLS are given in Table 3. Which of the three models considered so far would you prefer? Discuss and test!

As measured by Adj R-squared, model (g) is the superior of the three. It fits the data much better than model (d), and only slightly worse than model (e) (as measured by R-squared, and equivalently by Root MSE). Model (g) is simpler and slightly easier to interpret than model (e). In model (g), effects price only through differential effect of citation. It is nice to isolate the effect of the most interesting covariate, and it is nice to get strong significance (and expected sign) for this covariate. This is the case in model (g). This discussion leads to model (g) as the preferred one.

The three models are only partially nested: (g) is obtained from (e) by setting three coefficients to zero, and (d) is obtained from (e) by setting 6 coefficients to zero. But model (d) cannot be obtained from (g) by setting coefficients to zero. Since model (d) obviously is inferior, we only test model (g) versus model (e). The test problem is versus at least one of the coefficients being non-zero. The F-statistic is . This number is compared to the F-distribution with 3 and 171 degrees of freedom, and the conclusion is that there is hardly evidence for claiming one or more of the three coefficients being non-zero ( is not rejected at any meaningful level).

(h)Table 4 gives the variance inflation factors for model (g). What do these numbers tell you? Suggest a change of variables that will reduce the unwanted effects of large inflation factors, but without changing the essence of model (g).

The two first VIFs are terribly high. They tell us that and are strongly linearly related, as they certainly must be. That the p-value for (testing its coefficient being non-zero) is as low as 12% is really impressing in the presence of this strong colinearity. The colinearity could be reduced by replacing by , which is the residual obtained by regressing on by OLS. This would alter the coefficient for and also its SE, but the coefficient for would not be changed. The fit (R-squared, sums of squares) would remain unchanged.

(i)Returning to Bergstrom’s paper. Do you agree that private journals are over-priced? Based on your preferred model, describe the pricing policy and profit generation in private journals.

I base my discussion on model (g). The strong significant effect of indicate that private publishers price journals relatively higher than society publishers the more valuable the journal is as measured by citations. With non-profit journals as a standard, private journals are overpriced – and more so the more they are cited.

(j)Are economists in academia loyal to their non-profit journals in the sense that University libraries are more prone to subscribe to a journal published by a scholarly society when everything else is equal? To address this question, the following model is considered.

The OLS results for this model are given in Table (5). Discuss the issue raised. Note that the supplier side in the journal market is a mixed bag. Non-profit journals are generally priced according to real production cost, with the hard work of editing and refereeing done on a no-pay basis. These journals are thus priced with little regard to what could have been their market price.

The signs for , , , and do make sense. However, libraries do seem to subscribe less to non-profit journals than privately published ones, everything else equal. This effect is non-significant. It could be due to heavier marketing by private publishers. For private publishers, we do also have a problem with simultaneity. Private publishers are likely to fix their price in response to what they perceive as the demand function. A privately published journal of the same quality as a society journal is thus likely to be priced higher, see point (i). The number of subscribers would then be reduced, and it is thus likely that subscription is higher for non-profit journals than private journals of the same quality.

(k)Inspecting the empirical residuals from model (j), a pattern is noted. The pattern seems to be

This formula is obtained by regression. Several regression models were attempted to find a reasonable model. Explain why this finding indicates heteroscedasticity. How can the formula be used to construct weights for a weighted regression? The results from such a weighted regression is given in Table 6. Discuss the pros and cons of using this particular weighted regression rather than the OLS. Which of the 95% confidence intervals for given in Table 5 and Table 6 respectively will you prefer?

Under homoscedasticity, we should have no relation between any of the covariates and or with . The given relation indicates that is increasing with and decreasing with . Disregarding the effect of non-linearity, one might replace by in the relation, and obtain , and the inverse of this as the weight in weighted regression.

The pros: The weighted regression provides a better fit, as measured by R-squared and F. It also reduces standard errors. The cons: the particular weights have been found by “running several regressions”. Such fishing trips might bring home a catch, but not necessarily bring out a pattern in the variance that prevails in repeated sampling – whatever that could be in the particular case. The chosen weights might therefore represent some degree of over-fitting to the data. On balance, however, we have a relatively large sample and a distinct improvement in the fit. I will therefore vote for the weighted regression, and my confidence interval of choice for is

(l)Our data consists of 180 journals in economics. This is pretty much the collection of academic journals in this field that use the English language. This collection is thus not a random sample from some existing population. Explain the statistical meaning of a confidence interval, say that in point (e), and discuss the difficulties involved in this interpretation since we do not sample in a simplistic sense.

Confidence intervals make precise sense when hypothetical repetitions of the experiment are meaningful: In repetitions, the method produce intervals that covers the true value in a fraction of cases that agrees with the degree of confidence. This interpretation might be extended slightly. The particular study might not be reproducible by drawing an independent sample. But, in studies where the assumed model indeed represents the uncertainties involved, the fraction of studies leading to their produced confidence intervals covering the true values (which might vary from study to study) will match the degree of confidence (which is kept fixed) in the long run.

Repeated sampling makes no sense in the present study. The question is whether the assumed model, say (e), correctly represents the uncertainties involved. I have my doubts. There are, for example, reasons why some journals are privately published and others are society journals. The society journals tend to be more general in scope than the private ones. It is easier to carve out a market segment in a limited area, say labour economics of fisheries economics, and there establishing a private journal with a dominating position, than in the more general areas. Perhaps such conditions are more important for the pricing than the covariates in (e).

The big question is rather: to what extent are our results externally valid? That is, to what extent do our results have validity for extrapolation in time, language area and field of science? It is hard to give a good answer here. The best approach might be to study the journals in a few other fields of science and see if the same pattern emerges.

APPENDIX

(Output based on Stata)

Figure 1. Price by number of pages for non-profit and private journals.

Figure 2. Scatter plots for logged variables. The plot for LC (on the y-axis) versus LN is, for example, found in row 3 and column 4. Non-profit journals.

Figure 3. Scatter plots for logged variables. Private journals.

Source | SS df MS Number of obs = 180

------+------F( 2, 177) = 27.34

Model | 36.8357611 2 18.4178806 Prob > F = 0.0000

Residual | 119.232662 177 .673630857 R-squared = 0.2360

------+------Adj R-squared = 0.2274

Total | 156.068423 179 .871890631 Root MSE = .82075

------

price LP | Coef. Std. Err. t P>|t| [95% Conf. Interval]

------+------