Gangsheng Wang and Shulin Chen

ORMAT Evaluation of soil greenhouse gas emission models using Bayesian inference and MCMC: I. Model uncertainty

Department of Biological Systems Engineering, Washington State University, Pullman, WA 99164-6120, USA

Corresponding Author: Shulin Chen, PhD, PE, Professor,

Bioprocessing and Bioproduct Engineering Laboratory

Department of Biological Systems Engineering

Washington State University

Pullman, WA 99164

509-335-3743 (phone)

509-335-2722 (fax)

ABSTRACT

The combination of the Bayesian inference and the Markov Chain Monte Carlo (MCMC) technique provides an effective approach for quantifying the uncertainties in the process-based soil greenhouse gas (GHG) emission models. In this paper, the most important procedure, the Metropolis-Hastings sampling, was investigated in details by comparing four univariate proposal distributions (symmetric/asymmetric uniform, and symmetric/asymmetric normal) and one multivariate normal proposal distribution. As a result of the case study on the SoilGHG model, almost all the posterior parameter ranges from the multivariate distribution shrank into an order of magnitude, even for those parameters with wide posterior ranges using the univariate distributions. The narrow standard deviation (below 10% of the mean value) meant small simulation errors in the SoilGHG model. The simulation errors in CO2 fluxes were much larger than those in N2O fluxes, which resulted in a greater importance in model structure than in model parameters for CO2 simulations, and this was opposite for N2O simulations. Significant implications from this study include: (i) deriving the covariance matrix of parameters for the multivariate normal proposal distribution based on the sampling results from the univariate distributions; and (ii) updating a single parameter rather than to update the entire parameter vector each time. The discussions on the uncertainties from different sources indicated that the model parametric and structural uncertainties could be controlled within a reasonable range using the SoilGHG model, and the measurement errors in gas fluxes and the uncertainties in model initialization were also important for model applications.

Key words: Bayesian inference; greenhouse gas (GHG); Markov Chain Monte Carlo (MCMC); model uncertainty

1. Introduction

[1] Process-based mathematical models have been developed to simulate the greenhouse gas (GHG) emissions as an important part of the carbon-nitrogen dynamics in soils (Chen et al., 2008; Ma and Shaffer, 2001; Smith et al., 1997; Wu and McGechan, 1998). However, the studies on the uncertainties in these models and model applications are limited (Wang and Chen, 2010). A subjective interpretation of uncertainty is “the degree of confidence that a decision maker has about possible outcomes and/or probabilities of these outcomes” (Refsgaard et al., 2007). Uncertainty assessment is therefore important when models are used for decision-making. Uncertainty analysis will not only gives the uncertainty from different sources (i.e., model parameters, model structure, model inputs and outputs), but also gives an evaluation of model performance and limitations.

[2] Some studies on GHG models are currently concentrated on sensitivity analysis, which is one of the methods mentioned by Refsgaard et al.(2007). For example, the uncertainty of the PnET-N-DNDC model was evaluated by examining the sensitivity of the model outputs to such environmental factors as temperature, precipitation, photosynthetically active radiation (PAR) and some model input variables, e.g., N-concentration in precipitation, litter mass, soil organic carbon (SOC), pH, and soil texture (Stange et al., 2000). Herein the sensitivity analysis was conducted by changing one factor at a time while keeping all others constant.

[3] Monte Carlo is the most widely used method in uncertainty anaylsis. Thorsen et al. (2001) used the Monte Carlo method to analyze the propagation of uncertainty from input data to model output, and found that the magnitude of uncertainty is closely associated with the investigated temporal and spatial scale, i.e., smaller output uncertainty on catchment scale than on grid level. The Monte Carlo technique was also applied to assess the uncertainty of DenNit model output with regard to parameterization (Reth et al., 2005). A Gaussian distribution within ± one standard error was used for parameter sampling. In the comparison of carbon and nitrogen dynamics under conditions of conventional and diversified rotations, 64 parameter combinations were identified to test their impacts on model outcomes (Tonitto et al., 2007). An uncertainty analysis tool for the DNDC model allows for the selection of either Monte Carlo or Most Sensitive Factor (MSF) method (Li, 2007b). The MSF method involves running the model twice for each spatial unit (e.g., grid cell or polygon) with the maximum and minimum parameter values. These two runs generate two gas fluxes to form an interval which theoretically covers the real gas flux with a high probability (Li, 2007a). Using this uncertainty model, several highly sensitive factors influencing DNDC model were identified (Li et al., 1992a; Li et al., 1992b).

[4] The above-mentioned methods usually did not take into account the probability distribution of parameters. Even if a distribution was considered, e.g., a Gaussian distribution in Reth et al. (2005), it was a priori. The posterior distribution of a parameter is more informative than a priori for modelers to use and evaluate a data-driven model. The Bayesian inference is exactly the theory which is capable of deriving the posterior distribution of parameters based on the priori and the observed inputs and outputs (Congdon, 2001; Lee and Kim, 2008).

[5] The objective of this paper is to assess the model uncertainties due to model parameters and model structure by coupling the Bayesian theory with the Markov Chain and Monte Carlo (MCMC) method. The uncertainties from these two sources are also compared with those due to model input data and observation errors in the outputs. A soil GHG emission model is also introduced to test the proposed uncertainty analysis method. These uncertainty analysis algorithms can also be integrated into the existing models.

2. Methods

2.1. Model uncertainty analysis method

2.1.1. Bayesian inference

[6] The Bayesian inference (Blasone et al., 2008; Congdon, 2001; Engeland et al., 2005; Lee and Kim, 2008; Valstar and Minnema, 2002; Vrugt et al., 2008a) is a useful tool for the uncertainty analysis of model parameters and structure. Based on Bayes’ theorem, the posterior distribution of both model parameters and model variance from the prior distribution (e.g., the prior distribution can be simply assumed as uniform distribution) can be derived, and the 95% confidence intervals of any output variables due to the parameter uncertainty and the model structure uncertainty can be estimated. Additionally, the uncertainty analysis can narrow the ranges for parameters and also give an estimation of parameter values and correlations between parameters.

[7] According to Bayesian inference (Congdon, 2001; Yang et al., 2007), the posterior distribution π(Θ|yt), i.e., the likelihood function L(Θ |yt), of parameter set Θ (model parameters θ and some statistical parameters γ) can be generated from the prior distribution f(Θ) conditioned on observations yt:

where ( θ, σy) is a vector including the model parameter set (θ)and the variance (σy) depicting simulation errors; f() is the prior distribution function; is the distribution function of model output variable conditioned on ; is the likelihood function, i.e., the posterior distribution function of conditioned on observations , where t is a time index. Generrally, is a transformation of the model output , e.g., , which yields a homoscedastic varaince for the simulation errors (Engeland et al., 2005; Kuehl, 1999).

[8] As for the prior distribution, a common approach is to assume uniform priors (Iskrev, 2007), which means is a constant. In addition, the model output () is assumed to follow a normal distribution:

Thus

[9] When all the simulation errors are assumed to be independent, the likelihood of the model outcome can be expressed as the product of the likelihood of each individual outcome at each time step.

2.1.2. Metropolis-Hastings Algorithm for MCMC

[10] The Metropolis-Hastings (MH) algorithm (Kuczera and Parent, 1998; Owen and Tribble, 2005; Sandor and Andras, 2004; Twarakavi et al., 2008; Vrugt et al., 2008a; Vrugt et al., 2008b) is a typical Markov Chain Monte Carlo (MCMC) sampling method to randomly sample from the posterior distribution described by Eq. . The procedure of MH can be found in many reports (Chumbley et al., 2007; Engeland et al., 2005; Hastings, 1970; Link and Barker, 2008; Mathe and Novak, 2007; Tiana et al., 2007). An important criterion in MH is the acceptance probability:

where xk is the current state of the chain; x* is the new state of the chain generated using a specified irreducible proposal distribution ; π(·) is the posterior distribution function obtained from Eq. .

[11] If we draw a random number (Z) from the uniform distribution , then

[12] If the proposal distribution is symmetric, i.e., , then the acceptance probability can be simplified as

[13] The usual proposal distributions are introduced as follows. The ratio of the proposal distributions between the current state and the new state is also derived for the asymmetric distributions (see Appendix for details).

(i) Symmetric uniform distribution

(ii) Asymmetric uniform distribution

, if ,

or , if 10)

11)

(iii) Symmetric normal distribution

12)

(iv) Asymmetric normal distribution

13)

14)

(v) Symmetric multivariate normal distribution

15)

where is the covariance matrix of the parameters, d is the number of dimensions (parameters).

[14] In the transition from the current state to a new state, three strategies may be used (Hastings, 1970): (i) All elements in the parameter vector are changed. In this case, and are no longer the one-dimensional parameter as the previous four distributions, they are vectors containing d parameters. (ii) Only one of the elements is randomly selected and changed. (iii) Only one element is changed, and this element is selected in a fixed rather than a random sequence.

[15] It is noted that the implementation of MCMC is difficult if a parameter has a space ranging several orders of magnitude. For example, a parameter ranges from 10−7 to 10−1, and the current state of the chain is 10−4, it is not easy to generate a new state with a value has a magnitude of 10−6 or 10−2. In this case, we suggest conducting MCMC pertaining to the logarithm of the parameter, i.e., the parameter space is transformed from (10−7, 10−1) to (−7, −1). Hereafter the logarithm-transformed parameter space is called the logarithm parameter space (or range).

2.1.3. Generating random vectors from multivariate normal distribution

[16] The probability density function of the general form of the multivariate normal distribution is

16)

where d is the dimension of ; is the determinant of ; and is the vector of means, which is substituted by the current state in MCMC.

[17] It is not as simple as the one-dimensional distributions to implement the multivariate normal distribution. It is required to know the covariance matrix in advance, and to randomly generate a vector, not a single parameter value, from the distribution. Although the true covariance matrix is unknown, it can be approximately estimated from the parameter samples generated by MCMC using any of the four univariant distributions. The algorithm described in Hernadvolgyi (1998) is followed to generate random vectors from the multivariate normal distribution. In summary, the procedures to apply the multivariate normal distribution in the MH algorithm include:

(i) Generate parameter samples by MCMC using any of the four univariate proposal distributions.

(ii) Calculate the covariance matrix Ʃ from the parameter samples, and compute the eigenvalues and the corresponding eigenvectors. Let Φ be the eigenvector matrix whose columns are the normalized eigenvectors and let Λ be the diagonal matrix whose diagonal entries are the eigenvalues in the order corresponding to the columns of Φ. Calculate

17)

(iii) The average median values of the parameters from the four univariate distributions are used as the initial values when the multivariate distribution is adopted. In addition, the original prior parameter space is used for the multivariate distribution, which still ensures that the parameters are sampled from a wide space.

(iv) Generate a vector whose elements all follow the standard normal distribution N(0,1).

(v) Use a linear transformation to get

18)

where is the vector from the multivariate normal distribution .

(vi) As mentioned before, the entire vector or a single element in the vector may be updated in each state transition.

2.1.4. Uncertainty assessment due to model parameters and structure

[18] The 95% confidence intervals for any interested output due to parameter uncertainty can be computed by the output samples from a certain number of parameter sets generated by the MH algorithm. The 95% confidence intervals for both the parameter uncertainty and the model structure uncertainty were calculated by adding the model residuals to each of the output values at each time step, where the residuals had a mean of zero and a variance (σy).

2.2. A Test Model for Uncertainty Analysis

[19] In this section, a test model, the soil greenhouse gas emissions model (SoilGHG), is developed. The major transformation processes in SoilGHG include decomposition, mineralization/immobilization, nitrification, and denitrification. In addition, gas emissions is described by a diffusion process governed by Fick’s law (Campbell and Norman, 1998; Muller, 1999). The model scheme and major components are shown in Fig. 1.

2.2.1. Decomposition

[20] Like the carbon-nitrogen model in Porporato et al. (2003), three functional pools are considered in SoilGHG, i.e., added organic matter (AOM, i.e., litter, fertilizer, or manure), humus, and microbial biomass. The carbon and nitrogen balance equations for the three pools can be expressed as (Manzoni et al., 2004; Porporato et al., 2003):