Supplementary Materials s40

Supplementary materials

1. GA-PLS modeling

1.1 Genetic algorithms

Our GA model includes five components: encoding, population initialization, individual selection, crossover, and mutation. Spectral variables (S: from in situ or AISA image spectra, see Figure S2) will be encoded with binary data: zeros and ones as chromosomes. Fitness of every chromosome will be evaluated using a predefined fitness function to determine whether it satisfies constraints with regard to water quality parameters (Y). If it is satisfactory, its output will be the selected results; if not, chromosomes with better fitness will be selected to “survive.” From crossover and mutation, offspring will be generated (similar to the combination of bands). The fitness of every chromosome will be evaluated again. This step will be repeated until the fitness satisfies the predefined constraints (Ding et al., 1998; Leardi, 2000). Details on GA for our model can be found in Li et al. (2007) and Song et al. (2011).

1.2 Partial least squares (PLS)

A simple PLS model consists of two outer relations and one inner relation (Figure S2). Let both X [n × m] represent an explanatory matrix (spectral variables), the first outer relation is derived by applying principal component analysis (PCA) to X, resulting in the score matrix T [n × a] and the loading matrix P' [a × m] plus an error matrix E [n × m], that is, X = TP’ + E. The detailed explanation for PLS structure and inner and outer relationship between X and Y can be referred to Li et al. (2007) and Song et al. (2012). The goal of the PLS model is to minimize the norm of F while maximizing the covariance between X and Y using the inner relation. The selection of the optimal number of PLS components is a key step to obtaining a model with good predictive capability. Leave-one-out cross validation was applied in this study and details on implementation can be found in Leardi (2000) and Li et al. (2007).

1.3 GA-PLS implementation

Correlation analysis with narrow-band spectra reflectance, derivative and all possible band ratios (Malthus and Dekker, 1995) were conducted to preliminarily select sensitive spectral variables (100 spectral variables from band ratios (50), narrow-band spectra (20), and derivative (30)). Then these spectral variables were further processed by GA to select the most sensitive latent variables for PLS modeling for various water quality parameters (e.g., Chl-a, PC, TSM and SDD). To avoid overfitting, the program was met the following features:(1) the parameters were set with the highest elitism using Leardi’s method (2000); (2) the model was set to be determined after 100 independent, short GA runs; and (3) the frequency of selection for the variables of each run was set to be a weighted average between the frequency for selection of the variables in the starting run and the previous run. The fitness function with which the individuals were subject to evaluation was the percentage of predicted variance of a constituent abundance, defined as:

(1)

where and are measured and predicted water quality values, n is the number of samples to be considered, and k = n-1 in the case of cross-validation. The performance for GA-PLS was evaluated with root mean square error (RMSE) through cross-validation process, which is written as:

(2)

where N is the total sample numbers, other variables are same as that in Equation (1). In our study, RMSE will also be applied for model performance validation. All samples in this study were divided into calibration (70%) and validation (30%) subgroups for testing the model performances. Also spectra from in situ and AISA image were calibrated and validated separately, in which AISA spectra training information was further applied for water quality parameter mapping using GA-PLS.

2. Supplementary Figures

Figure S1. Comparison between the ASD in situ measured and the calibrated AISA image spectra for sampling station 5.

Figure S2. Flowchart for GA-PLS, the left diagram shows how the genetic algorithm selects spectral variables, and the right diagram shows how to use partial least squares to derive water quality variables, e.g., TN and TP based on GA selected spectral variables.

Figure S3. Normalized frequency distribution of water quality parameters for aggregated datasets, (a) Total nitrogen (TN) concentration, and (b) Total phosphorus (TP) concentration.

3. Supplementary references

Ding Q, Small GW, Arnold MA (1998) Genetic algorithms-based wavelength selection for the near infrared determination of glucose in biological matrixes: Initialization strategies and effects of spectral resolution. Anal Chem. 70(21): 4472–4479.

Leardi R (2000) Application of genetic algorithm–PLS for feature selection in spectral data sets. J. Chemometrics 14: 643–655.

Li L, Ustin S L, Riano D (2007) Retrieval of fresh leaf fuel moisture content using Genetic Algorithm–Partial Least Squares modeling (GA-PLS). IEEE T. Geosci Remote S. 4(2): 216–220.

Malthus TJ, Dekker AG (1995) First derivative indices for the remote sensing of inland water quality using high spectral resolution reflectance. Environ Int. 21(2): 221–32.

Song KS, Li L, Tedesco LP, Li S, Clercin AN, Hall B, Li ZC, Shi K (2012) Hyperspectral determination of eutrophication for a water supply source via genetic algorithm-partial least squares (GA–PLS) modeling. Sci Total Environ. 426: 220–232.