1
GARCH Estimation by True Maximum Likelihood
J. Huston McCulloch
Ohio State University
July 31, 2001
GARCH Estimation by ML
The GARCH(1,1) model
(1)
has an easily computed conditional likelihood, conditional on 1, of
.(2)
There may be additional parameters, not listed, governing the shape of the distribution of f(z) and/or regression parameters if the errors are regression residuals.
Each scale value t (and therefore the initial value 1) has a common unconditional distribution with some precise p.d.f. g() and corresponding c.d.f. G(). This is not a standard distribution, but it can be approximated numerically, given the parameters and f(z), by iterating repeatedly on (1), using an appropriate transformation of onto a finite interval. Even if there were a long run of zero errors, t2 could never fall below , although there is some probability that it could come arbitrarily close to this value from above. The square root of is therefore the infimum of the support of g().[1]
Given g(), the exact likelihood of the errors may be found simply by taking the expectation of (2) over g(1):
(3)
The improper integral in (3) may be simplified to a proper integral by means of the substitution y = G(1), as follows:[2]
(4)
Note that as 1, the initial terms in (2) decline to 0, so that
.(5)
Comparison to Alternative Initializations
Most practitioners simply insert some value of 1 into (2) and appeal to the fact that maximizing the resulting function over the GARCH parameters (and any other parameters that may be present) will give Quasi Maximum Likelihood estimates whose dependence on 1 will die out as the sample size grows, and hence will share the consistency of true ML estimates. However, with finite sample sizes, the choice of 1 can have an important effect on the parameter estimates and can alter the distribution of test statistics. Even asymptotically, it is not clear (at least not to me) that the distribution of QML test statistics will not be sensitive to the choice of 1.
McCulloch (1985) introduced a conditionally symmetric stable IGARCH(1,1) model to characterize interest rate volatility. This study maximized the conditional likelihood (2), basing 1 on the first 12 monthly observations on t, and otherwise discarding these observations.[3] This procedure of course makes less than full use of the available data. Bollerslev (1986, p. 315, n. 4) likewise conditions on pre-sample values.
Engle and Bollerslev (1986) set
(6)
EViews 4.0 (2000, p. 385) similarly “backcasts” 12, using a geometrically declining weighted average of the squared errors, with an arbitrary decay factor of 0.7. By thus conditioning on in-sample values, both these approaches technically violate the joint probability decomposition that leads to (2). Such an initialization will not adequately penalize parameter values that do not fit the data well, since 1 is being determined by the data rather than by the parameters. “Backcasting” will overfit particularly well, since it accommodates the volatility of the first several errors quite closely, regardless of the parameters being considered.
Bidarkota and McCulloch (1998) treat 1 as an additional parameter to be found by maximizing (2). This clearly overstates the value of the integral in (4) and therefore the likelihood, since it replaces the integrand by its maximal value. Again, the overstatement will generally be at its worst for bad parameter values, since 1, whose distribution is in fact determined precisely by the parameters, is instead being set to conform to the data.
Andrews (2001) sets
. (7)
This choice almost certainly understates the likelihood when > 0, since it replaces the integrand in (4) by the relatively small value it takes on at y = 0. His paper is concerned with testing the null hypothesis of no GARCH, which he correctly notes is the single restriction = 0 in (1). Under this null hypothesis, becomes unidentified, and g() collapses on , so that
,(8)
and Andrews’ choice is appropriate. Under the alternative hypothesis > 0 and assuming Ezt2 = 1, however, a more appropriate single choice, short of evaluating (4), would be
.(9)
This choice does not clearly over- or under-state the likelihood. It does break down in the IGARCH case, but that is not an issue in testing for the complete absence of GARCH as in Andrews (2001). A QML test based on (7) probably has reduced power in comparison to ML or QML based on (9), since it does not allow the objective function to respond adequately to deviations from the null. If one’s goal is to estimate GARCH parameters by QML without excluding the IGARCH or trans-IGARCH cases, an appropriate improvement over (7) short of computing (4) would be to set 1 equal to the median or mode of g().[4]
Computational Considerations
Equation (4) may be numerically integrated by any simple method such as Simpson’s Rule. If Simpson’s rule is computed with n = 4m+1 equally spaced y-values, including the end points 0 and 1, where m is a positive integer, its precision may be roughly estimated by recomputing the integral using 2m+1 of these values, and assuming that the remaining error is less in absolute value than the difference in these two results. If the estimated error in the log likelihood is less than .01, the precision is more than adequate for likelihood comparison purposes. A value of m as small as 2 (yielding n = 11) might be sufficient, since h1 has little effect on any but the first few terms of (2). If so, true ML would take only about n-1 = 10 times as long as QML using one of the alternative initializations. With today’s computers, this is not a significant constraint for most problems.[5]
The exact likelihood could in principle be computed using the Sorenson-Alspach/Kitagawa filtering equations given by Harvey (1989, pp. 163-4) and employed by Bidarkota and McCulloch (1998) to model the trend of U.S. inflation. However, this would require laborious numerical integrals at a large number of points for each value of t, alternately computing the prior distribution for t given 1 ... t-1 and the posterior distribution for ht given 1 ... t, just to get the likelihood for a single value of the parameters. Equation (4) above gives the same result with far less computation.
A certain precaution should be exercised in computing (4). The log likelihood is ordinarily a manageable number, but (4) requires that the likelihood itself be integrated, and the likelihood itself can easily be either an underflow or an overflow. To circumvent this problem, let Li = log L(, , ; G-1(yi)) be the log likelihood at the selected points yi at which the integrand is to be evaluated,[*] and let Lmax be the maximum of these Li. Define Li* = Li – Lmax and Li* = exp(Li*). The maximum value of Li* is unity, and any underflows may simply be treated as zeros. Integrate the Li* numerically as above to obtain L*. The log of the likelihood as given by (4) is then L = log(L*) + Lmax.
References
Andrews, Donald W. K., “Testing when a Parameter is on the Boundary of the Maintained Hypothesis.” Econometrica69 (2001): 683-734.
Bidarkota, Prasad V., and J. Huston McCulloch, “Optimal Univariate Inflation Forecasting with Symmetric Stable Shocks.” Journal of Applied Econometrics13 (1998): 659-70.
Bollerslev, Tim, “Generalized Autoregressive Conditional Heteroskedasticity.” Journal of Econometrics31 (1986): 307-27.
Engle, Robert F., and Tim Bollerslev, “Modelling the Persistence of Conditional Variances.” Econometric Reviews5 (1986): 1-50.
EViews4 Command and Programming Reference. Quantitative Micro Software, Irvine CA, 2000.
Harvey, Andrew C., Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge, 1989.
McCulloch, J. Huston, “Interest-Risk Sensitive Deposit Insurance Premia.” Journal of Banking and Finance9 (1985): 137-156.
Nelson, Daniel B., “Stationarity and Persistence in the GARCH (1,1) Model.” Econometric Theory6 (1990): 318-34.
[1] If the process truly began at t = 1, 1 could conceivably take on any positive value. It is assumed here that in fact the process has already been going on indefinitely at t = 1, even though it is not observed until t = 1.
[2] Equation (4) is an example of what might be called “GARCHian integration,” by analogy to “Gaussian integration.” In the latter, the Gaussian c.d.f. is used as a transformation to simplify an expectation under a Gaussian p.d.f.
[3] Like Engle and Bollerslev (1986), McCulloch (1985) erroneously omitted the constant term in (1) in the mistaken belief that this is necessary to prevent the process from exploding when + = 1. See Nelson (1990) for the true stationarity conditions in the conditionally Gaussian cases. In order to keep expectations finite, McCulloch (1985) also replaced the squares in (1) with the first powers of absolute values, but in retrospect this was not necessary. The symmetric stable distributions include the Gaussian as a special case, so that the GARCH-normal model is subsumed within the GARCH-stable class.
[4] If the variance of t is infinite, as in the stable cases considered by McCulloch (1985) and Bidarkota and McCulloch (1998), the rationale for (9) disappears, but the mode or median of the unconditional distribution would still be appropriate.
[5] I see no particular advantage to using Legendre integration rather than Simpson’s rule, since the integrand is unlikely to be globally well approximated by a single polynomial. Simpson’s rule instead approximates it locally by globally unrelated quadratics.
[*] Typograhic note: The symbol L is a script upper case L, which may not show up correctly if the reader’s computer does not have the Lucida Casual font loaded.