Monte Carlo Integration with Markov Chain

Journal of Statistical Planning and Inference 138 (2008) 1967–1980

MonteCarlointegration with Markovchain

Zhiqiang Tan

Department of Biostatistics, Bloomberg School of Public Health, 615 North Wolfe Street, Johns Hopkins University, Baltimore, MD 21205, USA

Received 20 April 2006; received in revised form 25 April 2007; accepted 18 July 2007

Available online 15 August 2007

Abstract

There are two conceptually distinct tasks in MarkovchainMonteCarlo (MCMC): a sampler is designed for simulating a Markov

chain and then an estimator is constructed on the Markovchain for computing integrals and expectations. In this article, we aim

to address the second task by extending the likelihood approach of Kong et al. for MonteCarlointegration. We consider a general

Markovchain scheme and use partial likelihood for estimation. Basically, the Markovchain scheme is treated as a random design and

a stratified estimator is defined for the baseline measure. Further, we propose useful techniques including subsampling, regulation,

and amplification for achieving overall computational efficiency. Finally, we introduce approximate variance estimators for the

point estimators. The method can yield substantially improved accuracy compared with Chib’s estimator and the crude MonteCarlo

estimator, as illustrated with three examples.

Keywords: Gibbs sampling; Importance sampling; MarkovchainMonteCarlo; Partial likelihood; Stratification; Variance estimation

1. Introduction

MarkovchainMonteCarlo (MCMC) has been extensively used in statistics and other scientific fields.A key idea is to

simulate a Markovchain rather than a simple random sample for MonteCarlointegration. There are two conceptually

distinct tasks: a sampler is designed for simulating a Markovchain converging to a target distribution and then an

estimator is constructed on the Markovchain for computing integrals and expectations. The first task has been actively

researched such as finding effective sampling algorithms and diagnosing convergence in various MCMC applications

(e.g. Gilks et al., 1996; Liu, 2001). In this article, we shall be concerned with the second task, making efficient inference

given a Markovchain.

Suppose that q(x) is a nonnegative function on a state space X and its integral

Z =

∫

q(x)d

is analytically

intractable with respect to a baseline measure

. An MCMC algorithm can be applied to simulate a Markovchain

,...,x

) converging to the probability distribution with density

p(x) =

q(x)

E-mail address: .

doi:10.1016/j.jspi.2007.07.013

Page 2

1968

Z. Tan / Journal of Statistical Planning and Inference 138 (2008) 1967–1980

without requiring the value of its normalizing constant Z. Then a common approach is that, if n is sufficiently large,

these points are used as an approximate and dependent sample from the distribution p(x) for MonteCarlointegration.

For example, the expectation E

( ) of a function (x) with respect to p(x) can be estimated by the sample average

or the crude MonteCarlo estimator

¯E( ) =

∑

i=1

However, there are two challenging problems for this inferential approach.

First, the MonteCarlo variability of ¯E( ) is generally underestimated by the sample variance

∑

i=1

[ (x

) −

¯E( )]

divided by n, and the amount of underestimation can sometimes be substantial due to serial dependency. Specialized

methods by running multiple chains or using batch means are needed to assess the MonteCarlo error (e.g. Gelman and

Rubin, 1992; Geyer, 1992).

Second, the crude MonteCarlo estimator is not directly applicable to the normalizing constant Z, which may be of

interest and part of the computational goal. For example, if q(x) is the product of a prior and a likelihood in Bayesian

inference, Z is the predictive density of data.As a result, various methods have been proposed for computing normalizing

constants but most of them involve a matching density or additional simulation (e.g. DiCiccio et al., 1997; Gelman and

Meng, 1998).

WeconsiderageneralMarkovchainschemeanddevelopanewmethodforestimatingsimultaneouslythenormalizing

constant Z and the expectation E

( ). Suppose that a Markovchain [

(

),...,(

)] is generated as follows.

General Markovchain scheme: at each time t,

update

given (

t−1

) and

sample x

from p(·;

It is helpful to think of

as an index and x

as a draw given the index. Denote by the index set. The details of updating

are not needed for estimation, even though these details are essential to simulation. In contrast, the transition density

p(·; ) is assumed to be completely known on X with respect to

for

∈. This scheme is very flexible and

encompasses the following individual cases:

(i) a Markovchain(x

,...) with transition density p(·; x

t−1

) by letting

= x

t−1

(ii) a subsampled chain(x

b+1

2b+1

,...) by letting (

) be (x

tb+1

) in the original Markovchain.

(iii) Gibbs sampling, where the chain(

) has a joint stationary distribution p( ,x), and

is sampled from the

conditional distribution p( ; x

t−1

) and x

from the conditional distribution p(x;

(iv) Generalized Gibbs sampling, where a stochastic transformation g

is inserted after

is sampled and then

is sampled from p(·; g

(

))—the index

is expanded to be (

(v) Metropolis–Hastings sampling by identifying the current state as

and the proposal as x

, even though the

commonly-referred Metropolis–Hastings chain, that is (

,...,

), does not admit a transition density.

Case (i) is conceptually important but may be restrictive in practice, as we shall explain. In case (ii), the subsampled

chain(x

b+1

2b+1

,...) is Markovian but its transition density is complicated, being the b-step convolution of the

original transition density. In case (iii), the joint chain is Markovian on ×X with transition density

p(x; )p( ; x

t−1

but integrals of interest are defined on X. The marginal chain

,...,x

) is also Markovian but its transition density

involves integrating out of p(x; )p( ; x

t−1

), which is typically difficult.

A familiar example of case (iii) is data augmentation in Bayesian computation, where x is a parameter, is a latent

variable, and q(x) is the product of a prior and a likelihood (Gelfand and Smith, 1990; Tanner and Wong, 1987). Case

(iv) is represented by a recent development in data augmentation, where includes not only a latent variable but also

a working parameter (Liu and Sabatti, 2000; Liu and Wu, 1999; van Dyk and Meng, 2001).

Page 3

Z. Tan / Journal of Statistical Planning and Inference 138 (2008) 1967–1980

1969

The general description of the scheme allows the situation where x

is sampled component by component such as

sample x

from p

;

) and

sample x

from p

2|1

; x

Then p(x;

) is the product p

;

2|1

; x

).As a result, case (iii) or (iv) is not restricted to two-component

Gibbs sampling as it appears to be.

Assume that the sequence (x

,...,x

) converges to a probability distribution as n →∞. Denote byp

∗

(x) the

stationary density with respect to

. It is not necessary that p

∗

(x) be identical to the target density p(x). The sequence

,...,x

) typically converges to p

∗

(x)=p(x) in Gibbs sampling and its generalization [cases (iii) and (iv)]. In con-

trast, this sequence does not converge to p(x), i.e. p

∗

(x) = p(x), in Metropolis–Hastings sampling where (

,...,

)

converges to p(x) [case (v)]. We develop a general method in Section 2 and illustrate the application to Gibbs sam-

pling and its variation in Section 3. Applications of the general method to rejection sampling and Metropolis–Hastings

sampling are presented in Tan (2006).

We take the likelihood approach of Kong et al. (2003) in developing the method; see Tan (2004) for the optimality of

the likelihood approach in two situations of independence sampling. In that approach, the baseline measure is treated

as the parameter in a model and estimated as a discrete measure by maximum likelihood. Consequently, integrals of

interest are estimated as finite sums by substituting the estimated measure. Kong et al. (2003) considered independence

sampling and the simplest, individual case (i) of Markovchain sampling. We extend the likelihood approach to the

general Markovchain scheme by using partial likelihood (Cox, 1975). The approximate maximum likelihood estimator

of the baseline measure is

˜({x}) =

P({x})

−1

∑

j=1

p(x;

)

whereP is the empirical distribution placing mass n

−1

at each of the points x

,...,x

. The integral Z =

∫

q(x)d

is estimated by

˜Z =

∫

q(x)d˜ =

∑

i=1

q(x

)

∑

j=1

p(x

;

)

Note that the same estimator also holds for a real-valued integrand q(x). The expectation E

( ) =

∫

(x)q(x)d

∫

q(x)d

is estimated by

˜E( ) =

∫

(x)q(x)d˜

/∫

q(x)d˜

∑

i=1

)q(x

)

∑

j=1

p(x

;

)

∑

i=1

q(x

)

∑

j=1

p(x

;

)

Further, we modify the basic estimator ˜ and propose a subsampled estimator ˜

, a regulated estimator ˜ , an ampli-

fied estimator ˜

, and their combined versions. Finally, we introduce approximate variance estimators for the point

estimators.

Our method can require less total computational time (for simulating the Markovchain and computing the estimators)

thanstatisticallyinefficientestimationwithbrute-forceincreaseofsamplesizetoachievethesamedegreeofaccuracy.In

three examples, we find that the basic estimator log ˜Z has smaller variance than Chib’s estimator by a factor ⩾100, and

the basic estimator ˜E( ) has smaller variance than the crude MonteCarlo estimator by a factor of 1–100. Subsampling

helps to reduce computational cost while regulation or amplification helps to reinforce statistical efficiency, so that our

method achieves overall computational efficiency. We also find that the approximate variance estimators agree with the

empirical variances of the corresponding point estimators in all our examples.

Page 4

1970

Z. Tan / Journal of Statistical Planning and Inference 138 (2008) 1967–1980

2. Markovchain simulation

For the simplest case (i), i.e.

t−1

, Kong et al.’s (2003) model assumes that x

conditional on x

t−1

has distribution

p(·; x

t−1

/∫

p(x; x

t−1

)d ,

where is a nonnegative measure on X. The likelihood is

∏

i=1

[

p(x

; x

i−1

) ({x

})

/∫

p(x; x

i−1

]

For the general Markovchain scheme, even though the sequence [

(

),...,(

)] is a Markovchain, we consider

the model only specifying that x

conditional on

has distribution

p(·;

/∫

p(x;

)d ,

where is a nonnegative measure on X. The partial likelihood (Cox, 1975) is

∏

i=1

[

p(x

;

) ({x

})

/∫

p(x;

]

is deterministic given x

t−1

, for example

= x

t−1

, this likelihood coincides with the full likelihood.

Under similar conditions of support and connectivity in Vardi (1985), the maximum likelihood estimate has finite

support {

,...,x

} and

satisfies

({x}) =

P({x})

−1

∑

j=1

−1

(

)p(x;

)

whereZ(

∫

p(x;

)

d.Thisequationinvolvesnunknowns,

({x

}),...,({x

}),andpracticallydefiesnumerical

solution for large n. However, the difficulty can be overcome by substituting the true value Z(

) =

∫

p(x;

known to be one, in the above likelihood equation. The resulting estimator is given by ˜ in Section 1.

In retrospect, the Markovchain scheme basically provides a random design: an index

is stochastically selected and

then a draw x

is made from p(·;

) for 1⩽

i⩽n. The estimator ˜Z is a stratified importance sampling estimator using

one observation x

per distribution p(·;

). The estimator ˜E( ) is a weighted MonteCarlo estimator: the observations

have weights proportional to

q(x

)

−1

∑

j=1

p(x

;

)

These insights are important in themselves, independent of the model formulation, and are relevant to most of the

development here.

Now consider Gibbs sampling [case (iii)] where the sequence (x

,...,x

) converges to the target distribution p(x).

By the detailed balance equation, it follows that the average of successive transition densities n

−1

∑

j=1

p(x;

)

converges to p(x) pointwise. This result suggests that the estimator ˜Z converge faster than at the standard rate n

−1/2

because n

−1

∑

j=1

p(x;

) serves as the stratified importance sampling density and is asymptotically proportional to

the integrand q(x). This super efficiency for estimating the normalizing constant was observed in Kong et al. (2003)

and in our simulation studies (e.g. Table 1).

Moreover, the convergence result implies that the ratio

q(x)

−1

∑

j=1

p(x;

)

converges to the normalizing constant Z at any fixed x ∈X. For Gibbs sampling,Chib (1995) proposed evaluating the

above ratio at a high density point to estimate Z. Therefore, the estimator ˜Z appears to be an average of Chib’s ratios

Page 5

Z. Tan / Journal of Statistical Planning and Inference 138 (2008) 1967–1980

1971

Table 1

Comparison of estimators of log Z

n = 500

n = 1000

n = 2500

n = 5000

Chib

Lik

Chib

Lik

Chib

Lik

Chib

Lik

Time ratio

<.01

2.2

<.01

3.5

<.01

7.8

<.01

15.8

Bias

−.0042

−.013

−.0024

−.0063

−.00093

−.0025

−.00082

−.0012

Std Dev

.113

.0108

.0807

.00459

.0513

.00167

.0363

.000803

Sqrt MSE

.113

.0167

.0807

.00777

.0513

.00299

.0363

.00148

Approx Err

.0124

.00607

.00244

.00125

Note: Time ratio is the ratio of CPU seconds for integral evaluation against Gibbs sampling. Bias, Std Dev, and Sqrt MSE are the bias, standard

deviation, and

√

mean squared error of the point estimates, andApprox Err is

√

mean of the variance estimates. The results are based on MATLAB

programming.

at the sampled points x

. This relationship holds whether x

is sampled in one block or component by component. In

the latter case, p(x;

) is given by the product of componentwise transition densities (see Section 1).

For generalized Gibbs sampling [case (iv)], we conjecture that the average of densities n

−1

∑

j=1

p(x;

) still

converges to p(x) pointwise. A formal proof is not easily available due to the fact [

(

),...,(

)] may be a

nonpositive recurrent Markovchain, even though (x

,...,x

) is a positive recurrent subchain (van Dyk and Meng,

2001). In this case, Chib’s estimator remains to be formally studied, but is empirically investigated and compared with

the estimator ˜Z in our simulations (Section 3.2.2).

For Metropolis–Hastings sampling the average of densities, n

−1

∑

j=1

p(x;

) does not generally converge to p(x)

because (

,...,

) converges to p(x) [case (v)]. In this case, Chib’s estimator is not valid but the estimator ˜Z remains

appropriate (see Tan, 2006).

2.1. Subsampling

Implementationofourmethodcentersoncomputationof ˜,whichinvolves

evaluationsofp(x

;

)for1⩽

i, j ⩽n.

This part is computationally demanding for large n, even though each individual p(x

;

) is easy to evaluate.Therefore,

we propose a subsampled estimator to reduce the computational cost.

Specifically, divide the Markovchain into m blocks each of length b (n=bm). For 1⩽

i⩽b, the subsampled sequence

[(

), (

b+i

),...,(

(m−1)b+i

)] fits to the setup in Section 1:

b+i

is stochastically mapped from

(

) and x

b+i

is sampled from p(·;

b+i

), and so on. Applying ˜ to this sequence, we obtain

row i

({x}) =

row i

({x})

−1

∑

j=1

p(x;

(j−1)b+i−1

)

whereP

row i

is the empirical distribution on the ith subsampled sequence. To use the whole sequence, we take the

average

({x}) =

∑

i=1

row i

({x})

x=x

P({x})

−1

∑

j≡k (mod b)

p(x;

)

where the sum in the denominator is taken for all 1⩽

j ⩽n congruent with k modulo b if x = x

. Computation of ˜

involves bm

evaluations of transition densities, only a faction 1/b of those for computation of ˜. The normalizing

constant Z and the expectation E

( ) are then estimated by substituting ˜

in their definitions.

MacEachern et al. (2003) suggested the estimator ˜

row i

from a subsample and compared it with ˜ from a full sample,

both of length m. With b = 10, they reported a 10–43% reduction in standard deviation for estimating the normalizing

constant in Kong et al.’s (2003) probit regression example. In contrast, we find that subsampling is more helpful for

reducing bias than variance, and aggressive subsampling may inflate the variance; see Sections 2.2 and 3.

Page 6

1972

Z. Tan / Journal of Statistical Planning and Inference 138 (2008) 1967–1980

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Scheme (ii)

log Z

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Scheme (i)

log Z

Fig. 1. Boxplots of estimators of log Z. Left to right: basic estimator, subsampled estimator (b=10), and regulated estimator ( =10

−4

) and amplified

estimator under subsampling (a = 1.5).

2.2. Regulation and amplification

The performance of the estimator ˜Z or its subsampled version ˜Z

depends on the properties of the Markovchain

such as how quickly the Markovchain mixes and how closely the stationary density p

∗

(x) matches the integrand q(x).

For Gibbs sampling, the stationary density is made perfectly proportional to the integrand, and the problem of speeding

up the Gibbs sampler has been studied in the MCMC literature (e.g. Gilks et al., 1996; Liu, 2001). We address one

additional factor that affects the performance of the estimator ˜Z or its subsampled version ˜Z

For example, let

and X be the real line and

q(x) be exp(−x

/2)/

√

2 . Consider two sampling schemes, where

the sequence (x

,...,x

) converges to the standard normal distribution N(0, 1):

(i)

t−1

∼N( x

t−1

, 1 −

) and x

∼N(

, 1 −

(ii)

∼N(0, 1) and x

∼N(

, 1 −

Scheme (ii) with ≈1 is an extreme of those situations where theMarkovchain [

(

),...,(

)] mixes well

but the transition density p(·; ) is narrowly spread; see a generalized Gibbs sampling example in Section 3.2.

Under scheme (i), as increases to one, the Markovchain mixes more slowly. The estimator ˜Z has larger variance

and more serious bias. Under scheme (ii), the Markovchain mixes perfectly and the estimator ˜Z is unbiased for

any 0 1. But, the variance of ˜Z depends on the value of . If is close to one, the estimator ˜Z has a skewed

distribution with a heavy right-tail. Now consider the subsampled estimator ˜Z

. Under scheme (i), the subsampled

sequence becomes approximately independent for a large subsampling interval b. The estimator ˜Z

has reduced bias but

increased variance, compared with the estimator ˜Z. If is near one, the estimator ˜Z

tends to yield large overestimates.

Under scheme (ii), the estimator ˜Z

has an even skewed distribution, compared with the estimator ˜Z. These different

performances are illustrated in Fig. 1, which is based on 5000 simulations of size 1000 and =

.9.

The preceding discussion makes it clear that a poor performance may be caused by the narrow spread of the transition

density p(·; ). We propose two modifications of the basic ˜ or the subsampled ˜

. Only those of ˜ are presented and

those of ˜

should be understood in a similar manner. As seen from Fig. 1, these techniques are helpful for variance

reduction by removing those extreme estimates.

First, the spread of the transition density p(·; ) is relevant because it affects how uniformly the average

−1

∑

j=1

p(x;

) converges to the stationary density p

∗

(x) on a multiplicative scale for x ∈X. Nonuniformity

Page 7

Z. Tan / Journal of Statistical Planning and Inference 138 (2008) 1967–1980

1973

is most likely to occur on X where

∗

(x) is close to zero. We consider the regulated estimator

˜ ({x}) =

P({x})

∨[n

−1

∑

j=1

p(x;

)]

by censoring n

−1

∑

j=1

p(x;

) from below at ⩾0. For a real-valued function

q(x), the estimator

∫

q(x)d˜ has

asymptotic bias

∗

[

q(x)

∨p

∗

(x)

]

− Z = E

∗

[

q(x)

(

−

∗

(x)

)

∗

(x)< }

]

which is bounded in absolute value by

∫

∗

(x)<

|q(x)|d

, the integral of q(x) on the tail region {

∗

(x) < }.

The second modification hinges on the perspective that the Markovchain scheme provides a random design—the

index

indicates from which density, p(·;

), the draw x

is made for 1⩽

i⩽n. Let a(x)=a(x; ) be a transformation

on X such that the transformed density

(·; ) is more widely spread. The point a(x

) = a(x

;

) conditional on

has distribution p

(·;

) for 1⩽

i⩽n. The scale transformation a(x; ) = x + a[x − E(x; )], where E(; ) denotes

the expectation under p(·; ), will be used in our examples (Section 3). We consider the amplified estimator

({x}) =

({x})

−1

∑

j=1

(x;

)

whereP

is the empirical distribution on the transformed points a(x

),...,a(x

). A disadvantage of this technique is

that computation of ˜

does not build on that of ˜.

2.3. Asymptotic variance

To motivate our variance estimators, suppose that the set is finite and

,...,

are independently generated from a

commondistributionwithmassfunction ( ).Thenx

,...,x

areasimplerandomsamplefromp

∗

(·)=

∑

( )p(·; ).

The estimator ˜Z is the usual stratified importance sampling estimator

˜Z =

∑

i=1

q(x

)

∑

˜( )p(x

; )

where the observed mixture proportions ˜

( ) = ♯{i:

= , 1⩽

i⩽n}/n is used instead of the true mixture proportions

( ).

Theorem 1. For a real-valued function q(x), assume that var

∗

[q(x)/p

∗

(x)] ∞.The estimator ˜Z is consistent and

asymptotically normal with variance

−1

⎧

⎨

⎩

var

∗

[

q(x)

∗

(x)

]

−

∑

( )

(

∗

[

q(x)q(x; )

∗

(x)

]

− Z

)

⎫

⎬

⎭

If the integrand q(x) ∝p

∗

(x), the variance becomes zero, which implies that the resulting estimator converges at a

rate faster than n

−1/2

It is interesting to compare the estimator ˜Z with the unstratified estimator

∑

i=1

q(x

)

∗

)

∑

i=1

q(x

)

∑

( )p(x

; )

where the true mixture proportions ( ) are used. The asymptotic variance of the unstratified estimators is n

−1

var

∗

[q(x)/p

∗

(x)]. Therefore, the stratified estimator

˜Z has no greater asymptotic variance. More importantly, the estimator

˜Z requiresnoknowledgeofthetruemixtureproportionsinitsformula,andhenceisadaptivetotheunknowndistribution

of design indices.

Page 8

1974

Z. Tan / Journal of Statistical Planning and Inference 138 (2008) 1967–1980

The asymptotic variance of ˜Z can be estimated by

−1

⎧

⎨

⎩

∫

(

q(x)

∑

˜( )p(x

; )

−

˜Z

)

dP −

∑

˜( )

(

∫

q(x)p(x; )

[

∑

˜( )p(x

; )]

dP −

˜Z

)

⎫

⎬

⎭

Similar results can be obtained about the estimator ˜E( ).

Theorem 2. For a real-valued function q(x) whose integral Z is nonzero, assume that var

∗

[ (x)q(x)/p

∗

(x)] ∞.

The estimator ˜E( ) is consistent and asymptotically normal with variance