Journal of Statistical Planning and Inference 138 (2008) 1967–1980
MonteCarlointegration with Markovchain
Zhiqiang Tan
Department of Biostatistics, Bloomberg School of Public Health, 615 North Wolfe Street, Johns Hopkins University, Baltimore, MD 21205, USA
Received 20 April 2006; received in revised form 25 April 2007; accepted 18 July 2007
Available online 15 August 2007
Abstract
There are two conceptually distinct tasks in MarkovchainMonteCarlo (MCMC): a sampler is designed for simulating a Markov
chain and then an estimator is constructed on the Markovchain for computing integrals and expectations. In this article, we aim
to address the second task by extending the likelihood approach of Kong et al. for MonteCarlointegration. We consider a general
Markovchain scheme and use partial likelihood for estimation. Basically, the Markovchain scheme is treated as a random design and
a stratified estimator is defined for the baseline measure. Further, we propose useful techniques including subsampling, regulation,
and amplification for achieving overall computational efficiency. Finally, we introduce approximate variance estimators for the
point estimators. The method can yield substantially improved accuracy compared with Chib’s estimator and the crude MonteCarlo
estimator, as illustrated with three examples.
© 2007 Elsevier B.V. All rights reserved.
Keywords: Gibbs sampling; Importance sampling; MarkovchainMonteCarlo; Partial likelihood; Stratification; Variance estimation
1. Introduction
MarkovchainMonteCarlo (MCMC) has been extensively used in statistics and other scientific fields.A key idea is to
simulate a Markovchain rather than a simple random sample for MonteCarlointegration. There are two conceptually
distinct tasks: a sampler is designed for simulating a Markovchain converging to a target distribution and then an
estimator is constructed on the Markovchain for computing integrals and expectations. The first task has been actively
researched such as finding effective sampling algorithms and diagnosing convergence in various MCMC applications
(e.g. Gilks et al., 1996; Liu, 2001). In this article, we shall be concerned with the second task, making efficient inference
given a Markovchain.
Suppose that q(x) is a nonnegative function on a state space X and its integral
Z =
∫
q(x)d
0
is analytically
intractable with respect to a baseline measure
0
. An MCMC algorithm can be applied to simulate a Markovchain
(x
1
,...,x
n
) converging to the probability distribution with density
p(x) =
q(x)
Z
,
E-mail address: .
0378-3758/$ - see front matter © 2007 Elsevier B.V. All rights reserved.
doi:10.1016/j.jspi.2007.07.013
Page 21968
Z. Tan / Journal of Statistical Planning and Inference 138 (2008) 1967–1980
without requiring the value of its normalizing constant Z. Then a common approach is that, if n is sufficiently large,
these points are used as an approximate and dependent sample from the distribution p(x) for MonteCarlointegration.
For example, the expectation E
p
( ) of a function (x) with respect to p(x) can be estimated by the sample average
or the crude MonteCarlo estimator
¯E( ) =
1
n
n
∑
i=1
(x
i
).
However, there are two challenging problems for this inferential approach.
First, the MonteCarlo variability of ¯E( ) is generally underestimated by the sample variance
1
n
n
∑
i=1
[ (x
i
) −
¯E( )]
2
divided by n, and the amount of underestimation can sometimes be substantial due to serial dependency. Specialized
methods by running multiple chains or using batch means are needed to assess the MonteCarlo error (e.g. Gelman and
Rubin, 1992; Geyer, 1992).
Second, the crude MonteCarlo estimator is not directly applicable to the normalizing constant Z, which may be of
interest and part of the computational goal. For example, if q(x) is the product of a prior and a likelihood in Bayesian
inference, Z is the predictive density of data.As a result, various methods have been proposed for computing normalizing
constants but most of them involve a matching density or additional simulation (e.g. DiCiccio et al., 1997; Gelman and
Meng, 1998).
WeconsiderageneralMarkovchainschemeanddevelopanewmethodforestimatingsimultaneouslythenormalizing
constant Z and the expectation E
p
( ). Suppose that a Markovchain [
(
1
,x
1
),...,(
n
,x
n
)] is generated as follows.
General Markovchain scheme: at each time t,
update
t
given (
t−1
,x
t−1
) and
sample x
t
from p(·;
t
).
It is helpful to think of
t
as an index and x
t
as a draw given the index. Denote by the index set. The details of updating
t
are not needed for estimation, even though these details are essential to simulation. In contrast, the transition density
p(·; ) is assumed to be completely known on X with respect to
0
for
∈. This scheme is very flexible and
encompasses the following individual cases:
(i) a Markovchain(x
1
,x
2
,...) with transition density p(·; x
t−1
) by letting
t
= x
t−1
.
(ii) a subsampled chain(x
1
,x
b+1
,x
2b+1
,...) by letting (
t
,x
t
) be (x
tb
,x
tb+1
) in the original Markovchain.
(iii) Gibbs sampling, where the chain(
t
,x
t
) has a joint stationary distribution p( ,x), and
t
is sampled from the
conditional distribution p( ; x
t−1
) and x
t
from the conditional distribution p(x;
t
).
(iv) Generalized Gibbs sampling, where a stochastic transformation g
t
of
t
is inserted after
t
is sampled and then
x
t
is sampled from p(·; g
t
(
t
))—the index
t
is expanded to be (
t
,g
t
).
(v) Metropolis–Hastings sampling by identifying the current state as
t
and the proposal as x
t
, even though the
commonly-referred Metropolis–Hastings chain, that is (
1
,...,
n
), does not admit a transition density.
Case (i) is conceptually important but may be restrictive in practice, as we shall explain. In case (ii), the subsampled
chain(x
1
,x
b+1
,x
2b+1
,...) is Markovian but its transition density is complicated, being the b-step convolution of the
original transition density. In case (iii), the joint chain is Markovian on ×X with transition density
p(x; )p( ; x
t−1
),
but integrals of interest are defined on X. The marginal chain
(x
1
,...,x
n
) is also Markovian but its transition density
involves integrating out of p(x; )p( ; x
t−1
), which is typically difficult.
A familiar example of case (iii) is data augmentation in Bayesian computation, where x is a parameter, is a latent
variable, and q(x) is the product of a prior and a likelihood (Gelfand and Smith, 1990; Tanner and Wong, 1987). Case
(iv) is represented by a recent development in data augmentation, where includes not only a latent variable but also
a working parameter (Liu and Sabatti, 2000; Liu and Wu, 1999; van Dyk and Meng, 2001).
Page 3Z. Tan / Journal of Statistical Planning and Inference 138 (2008) 1967–1980
1969
The general description of the scheme allows the situation where x
t
is sampled component by component such as
sample x
1
t
from p
1
(x
1
;
t
) and
sample x
2
t
from p
2|1
(x
2
; x
1
t
,
t
).
Then p(x;
t
) is the product p
1
(x
1
;
t
)p
2|1
(x
2
; x
1
,
t
).As a result, case (iii) or (iv) is not restricted to two-component
Gibbs sampling as it appears to be.
Assume that the sequence (x
1
,...,x
n
) converges to a probability distribution as n →∞. Denote byp
∗
(x) the
stationary density with respect to
0
. It is not necessary that p
∗
(x) be identical to the target density p(x). The sequence
(x
1
,...,x
n
) typically converges to p
∗
(x)=p(x) in Gibbs sampling and its generalization [cases (iii) and (iv)]. In con-
trast, this sequence does not converge to p(x), i.e. p
∗
(x) = p(x), in Metropolis–Hastings sampling where (
1
,...,
n
)
converges to p(x) [case (v)]. We develop a general method in Section 2 and illustrate the application to Gibbs sam-
pling and its variation in Section 3. Applications of the general method to rejection sampling and Metropolis–Hastings
sampling are presented in Tan (2006).
We take the likelihood approach of Kong et al. (2003) in developing the method; see Tan (2004) for the optimality of
the likelihood approach in two situations of independence sampling. In that approach, the baseline measure is treated
as the parameter in a model and estimated as a discrete measure by maximum likelihood. Consequently, integrals of
interest are estimated as finite sums by substituting the estimated measure. Kong et al. (2003) considered independence
sampling and the simplest, individual case (i) of Markovchain sampling. We extend the likelihood approach to the
general Markovchain scheme by using partial likelihood (Cox, 1975). The approximate maximum likelihood estimator
of the baseline measure is
˜({x}) =
P({x})
n
−1
∑
n
j=1
p(x;
j
)
,
whereP is the empirical distribution placing mass n
−1
at each of the points x
1
,...,x
n
. The integral Z =
∫
q(x)d
0
is estimated by
˜Z =
∫
q(x)d˜ =
n
∑
i=1
q(x
i
)
∑
n
j=1
p(x
i
;
j
)
.
Note that the same estimator also holds for a real-valued integrand q(x). The expectation E
p
( ) =
∫
(x)q(x)d
0
/
∫
q(x)d
0
is estimated by
˜E( ) =
∫
(x)q(x)d˜
/∫
q(x)d˜
=
n
∑
i=1
(x
i
)q(x
i
)
∑
n
j=1
p(x
i
;
j
)
/
n
∑
i=1
q(x
i
)
∑
n
j=1
p(x
i
;
j
)
.
Further, we modify the basic estimator ˜ and propose a subsampled estimator ˜
b
, a regulated estimator ˜ , an ampli-
fied estimator ˜
a
, and their combined versions. Finally, we introduce approximate variance estimators for the point
estimators.
Our method can require less total computational time (for simulating the Markovchain and computing the estimators)
thanstatisticallyinefficientestimationwithbrute-forceincreaseofsamplesizetoachievethesamedegreeofaccuracy.In
three examples, we find that the basic estimator log ˜Z has smaller variance than Chib’s estimator by a factor ⩾100, and
the basic estimator ˜E( ) has smaller variance than the crude MonteCarlo estimator by a factor of 1–100. Subsampling
helps to reduce computational cost while regulation or amplification helps to reinforce statistical efficiency, so that our
method achieves overall computational efficiency. We also find that the approximate variance estimators agree with the
empirical variances of the corresponding point estimators in all our examples.
Page 41970
Z. Tan / Journal of Statistical Planning and Inference 138 (2008) 1967–1980
2. Markovchain simulation
For the simplest case (i), i.e.
t
=x
t−1
, Kong et al.’s (2003) model assumes that x
t
conditional on x
t−1
has distribution
p(·; x
t−1
)d
/∫
p(x; x
t−1
)d ,
where is a nonnegative measure on X. The likelihood is
n
∏
i=1
[
p(x
i
; x
i−1
) ({x
i
})
/∫
p(x; x
i−1
)d
]
.
For the general Markovchain scheme, even though the sequence [
(
1
,x
1
),...,(
n
,x
n
)] is a Markovchain, we consider
the model only specifying that x
t
conditional on
t
has distribution
p(·;
t
)d
/∫
p(x;
t
)d ,
where is a nonnegative measure on X. The partial likelihood (Cox, 1975) is
n
∏
i=1
[
p(x
i
;
i
) ({x
i
})
/∫
p(x;
i
)d
]
.
If
t
is deterministic given x
t−1
, for example
t
= x
t−1
, this likelihood coincides with the full likelihood.
Under similar conditions of support and connectivity in Vardi (1985), the maximum likelihood estimate has finite
support {
x
1
,...,x
n
} and
satisfies
({x}) =
P({x})
n
−1
∑
n
j=1
Z
−1
(
j
)p(x;
j
)
,
whereZ(
j
)=
∫
p(x;
j
)
d.Thisequationinvolvesnunknowns,
({x
1
}),...,({x
n
}),andpracticallydefiesnumerical
solution for large n. However, the difficulty can be overcome by substituting the true value Z(
j
) =
∫
p(x;
j
)d
0
,
known to be one, in the above likelihood equation. The resulting estimator is given by ˜ in Section 1.
In retrospect, the Markovchain scheme basically provides a random design: an index
i
is stochastically selected and
then a draw x
i
is made from p(·;
i
) for 1⩽
i⩽n. The estimator ˜Z is a stratified importance sampling estimator using
one observation x
i
per distribution p(·;
i
). The estimator ˜E( ) is a weighted MonteCarlo estimator: the observations
x
i
have weights proportional to
q(x
i
)
n
−1
∑
n
j=1
p(x
i
;
j
)
.
These insights are important in themselves, independent of the model formulation, and are relevant to most of the
development here.
Now consider Gibbs sampling [case (iii)] where the sequence (x
1
,...,x
n
) converges to the target distribution p(x).
By the detailed balance equation, it follows that the average of successive transition densities n
−1
∑
n
j=1
p(x;
j
)
converges to p(x) pointwise. This result suggests that the estimator ˜Z converge faster than at the standard rate n
−1/2
,
because n
−1
∑
n
j=1
p(x;
j
) serves as the stratified importance sampling density and is asymptotically proportional to
the integrand q(x). This super efficiency for estimating the normalizing constant was observed in Kong et al. (2003)
and in our simulation studies (e.g. Table 1).
Moreover, the convergence result implies that the ratio
q(x)
n
−1
∑
n
j=1
p(x;
j
)
converges to the normalizing constant Z at any fixed x ∈X. For Gibbs sampling,Chib (1995) proposed evaluating the
above ratio at a high density point to estimate Z. Therefore, the estimator ˜Z appears to be an average of Chib’s ratios
Page 5Z. Tan / Journal of Statistical Planning and Inference 138 (2008) 1967–1980
1971
Table 1
Comparison of estimators of log Z
n = 500
n = 1000
n = 2500
n = 5000
Chib
Lik
Chib
Lik
Chib
Lik
Chib
Lik
Time ratio
<.01
2.2
<.01
3.5
<.01
7.8
<.01
15.8
Bias
−.0042
−.013
−.0024
−.0063
−.00093
−.0025
−.00082
−.0012
Std Dev
.113
.0108
.0807
.00459
.0513
.00167
.0363
.000803
Sqrt MSE
.113
.0167
.0807
.00777
.0513
.00299
.0363
.00148
Approx Err
NA
.0124
NA
.00607
NA
.00244
NA
.00125
Note: Time ratio is the ratio of CPU seconds for integral evaluation against Gibbs sampling. Bias, Std Dev, and Sqrt MSE are the bias, standard
deviation, and
√
mean squared error of the point estimates, andApprox Err is
√
mean of the variance estimates. The results are based on MATLAB
programming.
at the sampled points x
i
. This relationship holds whether x
t
is sampled in one block or component by component. In
the latter case, p(x;
t
) is given by the product of componentwise transition densities (see Section 1).
For generalized Gibbs sampling [case (iv)], we conjecture that the average of densities n
−1
∑
n
j=1
p(x;
j
) still
converges to p(x) pointwise. A formal proof is not easily available due to the fact [
(
1
,x
1
),...,(
n
,x
n
)] may be a
nonpositive recurrent Markovchain, even though (x
1
,...,x
n
) is a positive recurrent subchain (van Dyk and Meng,
2001). In this case, Chib’s estimator remains to be formally studied, but is empirically investigated and compared with
the estimator ˜Z in our simulations (Section 3.2.2).
For Metropolis–Hastings sampling the average of densities, n
−1
∑
n
j=1
p(x;
j
) does not generally converge to p(x)
because (
1
,...,
n
) converges to p(x) [case (v)]. In this case, Chib’s estimator is not valid but the estimator ˜Z remains
appropriate (see Tan, 2006).
2.1. Subsampling
Implementationofourmethodcentersoncomputationof ˜,whichinvolves
n
2
evaluationsofp(x
i
;
j
)for1⩽
i, j ⩽n.
This part is computationally demanding for large n, even though each individual p(x
i
;
j
) is easy to evaluate.Therefore,
we propose a subsampled estimator to reduce the computational cost.
Specifically, divide the Markovchain into m blocks each of length b (n=bm). For 1⩽
i⩽b, the subsampled sequence
[(
i
,x
i
), (
b+i
,x
b+i
),...,(
(m−1)b+i
,x
(m−1)b+i
)] fits to the setup in Section 1:
b+i
is stochastically mapped from
(
i
,x
i
) and x
b+i
is sampled from p(·;
b+i
), and so on. Applying ˜ to this sequence, we obtain
˜
row i
({x}) =
P
row i
({x})
m
−1
∑
m
j=1
p(x;
(j−1)b+i−1
)
,
whereP
row i
is the empirical distribution on the ith subsampled sequence. To use the whole sequence, we take the
average
˜
b
({x}) =
1
b
b
∑
i=1
˜
row i
({x})
x=x
k
=
P({x})
m
−1
∑
j≡k (mod b)
p(x;
j
)
,
where the sum in the denominator is taken for all 1⩽
j ⩽n congruent with k modulo b if x = x
k
. Computation of ˜
b
involves bm
2
evaluations of transition densities, only a faction 1/b of those for computation of ˜. The normalizing
constant Z and the expectation E
p
( ) are then estimated by substituting ˜
b
in their definitions.
MacEachern et al. (2003) suggested the estimator ˜
row i
from a subsample and compared it with ˜ from a full sample,
both of length m. With b = 10, they reported a 10–43% reduction in standard deviation for estimating the normalizing
constant in Kong et al.’s (2003) probit regression example. In contrast, we find that subsampling is more helpful for
reducing bias than variance, and aggressive subsampling may inflate the variance; see Sections 2.2 and 3.
Page 61972
Z. Tan / Journal of Statistical Planning and Inference 138 (2008) 1967–1980
0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Scheme (ii)
log Z
0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Scheme (i)
log Z
Fig. 1. Boxplots of estimators of log Z. Left to right: basic estimator, subsampled estimator (b=10), and regulated estimator ( =10
−4
) and amplified
estimator under subsampling (a = 1.5).
2.2. Regulation and amplification
The performance of the estimator ˜Z or its subsampled version ˜Z
b
depends on the properties of the Markovchain
such as how quickly the Markovchain mixes and how closely the stationary density p
∗
(x) matches the integrand q(x).
For Gibbs sampling, the stationary density is made perfectly proportional to the integrand, and the problem of speeding
up the Gibbs sampler has been studied in the MCMC literature (e.g. Gilks et al., 1996; Liu, 2001). We address one
additional factor that affects the performance of the estimator ˜Z or its subsampled version ˜Z
b
.
For example, let
and X be the real line and
q(x) be exp(−x
2
/2)/
√
2 . Consider two sampling schemes, where
the sequence (x
1
,...,x
n
) converges to the standard normal distribution N(0, 1):
(i)
t
|x
t−1
∼N( x
t−1
, 1 −
2
) and x
t
|
t
∼N(
t
, 1 −
2
).
(ii)
t
∼N(0, 1) and x
t
|
t
∼N(
t
, 1 −
2
).
Scheme (ii) with ≈1 is an extreme of those situations where theMarkovchain [
(
1
,x
1
),...,(
n
,x
n
)] mixes well
but the transition density p(·; ) is narrowly spread; see a generalized Gibbs sampling example in Section 3.2.
Under scheme (i), as increases to one, the Markovchain mixes more slowly. The estimator ˜Z has larger variance
and more serious bias. Under scheme (ii), the Markovchain mixes perfectly and the estimator ˜Z is unbiased for
any 0 1. But, the variance of ˜Z depends on the value of . If is close to one, the estimator ˜Z has a skewed
distribution with a heavy right-tail. Now consider the subsampled estimator ˜Z
b
. Under scheme (i), the subsampled
sequence becomes approximately independent for a large subsampling interval b. The estimator ˜Z
b
has reduced bias but
increased variance, compared with the estimator ˜Z. If is near one, the estimator ˜Z
b
tends to yield large overestimates.
Under scheme (ii), the estimator ˜Z
b
has an even skewed distribution, compared with the estimator ˜Z. These different
performances are illustrated in Fig. 1, which is based on 5000 simulations of size 1000 and =
.9.
The preceding discussion makes it clear that a poor performance may be caused by the narrow spread of the transition
density p(·; ). We propose two modifications of the basic ˜ or the subsampled ˜
b
. Only those of ˜ are presented and
those of ˜
b
should be understood in a similar manner. As seen from Fig. 1, these techniques are helpful for variance
reduction by removing those extreme estimates.
First, the spread of the transition density p(·; ) is relevant because it affects how uniformly the average
n
−1
∑
n
j=1
p(x;
j
) converges to the stationary density p
∗
(x) on a multiplicative scale for x ∈X. Nonuniformity
Page 7Z. Tan / Journal of Statistical Planning and Inference 138 (2008) 1967–1980
1973
is most likely to occur on X where
p
∗
(x) is close to zero. We consider the regulated estimator
˜ ({x}) =
P({x})
∨[n
−1
∑
n
j=1
p(x;
j
)]
,
by censoring n
−1
∑
n
j=1
p(x;
j
) from below at ⩾0. For a real-valued function
q(x), the estimator
∫
q(x)d˜ has
asymptotic bias
E
p
∗
[
q(x)
∨p
∗
(x)
]
− Z = E
p
∗
[
q(x)
(
1
−
1
p
∗
(x)
)
1
{p
∗
(x)< }
]
,
which is bounded in absolute value by
∫
p
∗
(x)<
|q(x)|d
0
, the integral of q(x) on the tail region {
p
∗
(x) < }.
The second modification hinges on the perspective that the Markovchain scheme provides a random design—the
index
i
indicates from which density, p(·;
i
), the draw x
i
is made for 1⩽
i⩽n. Let a(x)=a(x; ) be a transformation
on X such that the transformed density
p
a
(·; ) is more widely spread. The point a(x
i
) = a(x
i
;
i
) conditional on
i
has distribution p
a
(·;
i
) for 1⩽
i⩽n. The scale transformation a(x; ) = x + a[x − E(x; )], where E(; ) denotes
the expectation under p(·; ), will be used in our examples (Section 3). We consider the amplified estimator
˜
a
({x}) =
P
a
({x})
n
−1
∑
n
j=1
p
a
(x;
j
)
,
whereP
a
is the empirical distribution on the transformed points a(x
1
),...,a(x
n
). A disadvantage of this technique is
that computation of ˜
a
does not build on that of ˜.
2.3. Asymptotic variance
To motivate our variance estimators, suppose that the set is finite and
1
,...,
n
are independently generated from a
commondistributionwithmassfunction ( ).Thenx
1
,...,x
n
areasimplerandomsamplefromp
∗
(·)=
∑
( )p(·; ).
The estimator ˜Z is the usual stratified importance sampling estimator
˜Z =
1
n
n
∑
i=1
q(x
i
)
∑
˜( )p(x
i
; )
,
where the observed mixture proportions ˜
( ) = ♯{i:
i
= , 1⩽
i⩽n}/n is used instead of the true mixture proportions
( ).
Theorem 1. For a real-valued function q(x), assume that var
p
∗
[q(x)/p
∗
(x)] ∞.The estimator ˜Z is consistent and
asymptotically normal with variance
n
−1
⎧
⎨
⎩
var
p
∗
[
q(x)
p
∗
(x)
]
−
∑
( )
(
E
p
∗
[
q(x)q(x; )
p
2
∗
(x)
]
− Z
)
2
⎫
⎬
⎭
.
If the integrand q(x) ∝p
∗
(x), the variance becomes zero, which implies that the resulting estimator converges at a
rate faster than n
−1/2
.
It is interesting to compare the estimator ˜Z with the unstratified estimator
1
n
n
∑
i=1
q(x
i
)
p
∗
(x
i
)
=
1
n
n
∑
i=1
q(x
i
)
∑
( )p(x
i
; )
,
where the true mixture proportions ( ) are used. The asymptotic variance of the unstratified estimators is n
−1
var
p
∗
[q(x)/p
∗
(x)]. Therefore, the stratified estimator
˜Z has no greater asymptotic variance. More importantly, the estimator
˜Z requiresnoknowledgeofthetruemixtureproportionsinitsformula,andhenceisadaptivetotheunknowndistribution
of design indices.
Page 81974
Z. Tan / Journal of Statistical Planning and Inference 138 (2008) 1967–1980
The asymptotic variance of ˜Z can be estimated by
n
−1
⎧
⎨
⎩
∫
(
q(x)
∑
˜( )p(x
i
; )
−
˜Z
)
2
dP −
∑
˜( )
(
∫
q(x)p(x; )
[
∑
˜( )p(x
i
; )]
2
dP −
˜Z
)
2
⎫
⎬
⎭
.
Similar results can be obtained about the estimator ˜E( ).
Theorem 2. For a real-valued function q(x) whose integral Z is nonzero, assume that var
p
∗
[ (x)q(x)/p
∗
(x)] ∞.
The estimator ˜E( ) is consistent and asymptotically normal with variance