Capturinguncertainty in sourcewater quality

Erin Towler1, Balaji Rajagopalan*,1,2, R.Scott Summers1 and Chad Seidel3

1Department of Civil, Environmental and Architectural Engineering, University of Colorado, Boulder, Colorado, USA

2Co-operative Institute for Research in Environmental Sciences (CIRES), University of Colorado, Boulder, Colorado, USA

3Damon S. Williams Associates, LLC,Lakewood, Colorado, USA

Abstract:

[ET1]

Drinking water utilities must provide safe drinking water and consistently comply with a host of regulations, controlling influent water quality through treatment plant technology and operational practices. Influent water quality, especially of surface water, which constitutes a major source for most of the United States population, has significant variability due to climatological, geological, and water management reasons. For utilities to efficiently manage their regulatory compliance, the influent water quality variability needs to be well understood. Unfortunately for many utilities, such data do not exist, leaving them without important information necessary to address compliance challenges. In such cases, utilities could benefit from the knowledge of existing national water quality databases tailored to their unique characteristics. Drinking water utilities have to consistently comply with regulations to provide safe drinking water. The compliance is dependent on influent water quality, operating policies, and technology. Of these, the influent water quality, especially of surface water, which constitutes a major source for most of the United States population, has significant variability due to climatological, geological, and water management reasons. Thus, for efficient management of utilities for their regulatory compliance, the variability in the influent water quality needs to be well understood. To this end, we develop a K-nearest neighbor (K-nn) bootstrap technique to generate ensembles of influent water quality conditioned on a “feature vector” that could include latitude, longitude and any other non-geographic attributes of interest. For a given feature vector the simulation proceeds by identifying its K nearest neighbors from the historical data, which are weighted to give highest weight to the nearest and least to the farthest. One of the neighbors is bootstrapped (i.e. resampled) using these weights. This is akin to estimating the conditional probability density function (PDF) with the K-nearest neighbors and simulating from it. The main strength of the approach is its simplicity in implementation and the ability to use large amount of spatial data with limited or no temporal data to provide variability estimates in time, especially when using data from the Information Collection Rule (ICR). We demonstrate the approach and its flexibility by applying it to simulate monthly ensembles of water quality parameters for total organic carbon (TOC) at a utility in Boulder, Colorado. We then demonstrate how the approach can be extended to other locations and other water quality parameters, specifically alkalinity and bromide. The simulations provide a rich variability and capture the historical observations well. Simulations are also viewed in light of recent available data. Extensions to generate a suite of influent water quality variables simultaneously and also include additional variables in the feature vector are straightforward.

Introduction

[ET2]

As new drinking water regulations come into effect, Drinking water utilities are face complex decisions when balancing new and changing regulatory requirements with competing water quality objectives, all the while meeting customer demands.d with the challenge of meeting them without jeopardizing their compliance with existing regulations. This is a formidable task, considering that actions taken to improve water quality can sometimes be in direct conflict with other water quality goals. As such, the drinking water community has focused substantial attention on understanding how and why these conflicts arise. From these efforts, the need for tToolsare needed to that can help utilities better understand treatment plant performance in light ofpredict compliance with specificchanging regulations has surfaced. The drivers for suchThesetoolsinclude must assess the treatment plant’s water quality, processes, and management practices. [ET3]Of these factors, influent water quality demonstrates the greatest degree of variability and can exert the most control of effluent water quality. , specifically when it enters the plant as influent, has considerable variability and is difficult to control. Consequently, the compliance prediction frameworktools that incorporate influent water quality should be probabilistic in nature with robust quantification of the influent uncertaintyvariability.The Several existing databases contain national water quality information (e.g., USGS, USEPA, State); however, most are limited in some way such as sample location geography, historical timeframe, sample parameters, etc. The USEPA’s Information Collection Rule database has represented the best drinking water-relevant dataset, yet it leaves more to be desired in terms of understanding source water quality variability and prediction capabilities beyond those locations sampled. The effort described herein aims to achieve these desires and offer additional tools to the drinking water community.variability in influent water quality, especially of surface water, is due to natural sources such as climate, floods, droughts, geologic formations, and management practices. Surface water constitutes a major source for most consumers in the United States hence, characterizing the influent variability is important for regulatory compliance. [CJS4]

Traditional methods for modeling uncertainty involve fitting a probability density function (PDF) to the data and using it to simulate “scenarios” thus capturing the variability, i.e., the Monte Carlo approach. There is a very rich history of this approach especially in hydrology and water resources management [Rajagopalan and Lall, 1999, Rajagopalan et. al. 1997 and references therein]. This traditional approach has several drawbacksincluding (i) limited choice of the PDF that can be fit to the data, especially if the data exhibits bimodal distribution, (ii) no choice other than normal distribution for more than one variable, (iii) not portable across sites – i.e, a singe PDF cannot be prescribed for all the locations and (iv) greatly influenced by outliers. Recent developments in nonparametric methods (see Lall, 1995 [ET5]for an overview of these methods and their applications to hydrologic problems) alleviate these drawbacks to a large extent and offer an attractive alternative. Within this, the K-nearest neighbor (K-nn) bootstrap technique [Lall and Sharma, 1996] and its variations have been developed and applied successfully to generate scenarios of, daily weather [Yates et. al., 2003, Buishand andBrandsma, 2001, Rajagopalan and Lall, 1999 ] and streamflows [Lall and Sharma, 1996; Prairie et al.,2005b, Gantz et al., 2005] for water management and salinity [Prairie et al., 2005a].

The objective of this paper is to develop a simple, robust, and flexible framework to generate influent water quality values and associated variabilityat locationsthat mayhave limited or no observed data. To this end, we adapt and develop the K-nn bootstrap technique to simulate influent water quality. The database from the United States Environmental Protection Agency’s required monitoring program, the Information Collection Rule (ICR),was used in this study [McGuire et al, 2002, USEPA, 2000]. The ICR database, developed to support recent rulemakings, has been assessed to better understand the uncertainty associated with influent water quality. Water quality parameters such as pH, temperature, alkalinity, hardness, turbidity, bromide, total organic carbon (TOC), and ultraviolet (UV) absorbance were included in the assessment. The ICR database has 18 monthly values covering July 1997 through December 1998. While this is temporally limiting for most traditional models, it is spatially robust with observations at 500 treatment plants within 296 water systems across the US[Roberson, 2002]. A strength of the technique developed here is its ability to use the spatial information to capture the temporal variability.

We apply the K-nn bootstrap technique to several raw water quality parameters: TOC, alkalinity, and bromide.Predicting the TOC concentration is of particular interest, as it is the main precursor for the formation of disinfection by-products (DBPs), which are regulated chemicals that impact most utilities and many small and medium sized utilities do not have significant historical records of influent TOC. Finally, surface water TOC concentrations exhibit spatial and temporal variability that can be challenging to capture using conventional methods.The methodology is applied totwo utilities with very different watersheds for TOC and toalkalinity and bromide at two other utilities.

Methodology

Any attempt to generate scenarios is a conditional PDF simulation problem. For example, a utility might be interested in generating an ensemble of TOC concentration values capturing the variability for a given month conditioned on a suite of variables such as the annual average TOC concentration, latitude, longitude, etc. This can be represented by the generalized formula as

= (1)

where is the joint probability distribution and and is the marginal probability distribution. Here, x is the monthly TOC concentration and is the “feature vector” of conditioning variables – in this case the average annual TOC concentration, latitude, and longitude. For a given user-specified feature vector, K-nearest neighbors are identified from the database that are nearest in the feature vector space. For example, if the feature vector consisted of only latitude and longitude then the K-nearest neighbors will be the physically closest locations.Distances between the user’s feature vector and all of the eligible entries in the ICR database are calculated using a Mahalanobis distance metric. This has an advantage over other distance metrics in that the components of the feature vector do not need to be scaled [Davis, 1986, Yates et al., 2003]. Different weights can be applied to each variable of the feature vector, thus allowing flexibility in selection of the neighbors, and this is demonstrated in the following section.The K-nearest neighbors are then weighted so that the nearest neighbor (i.e. a location) is given the maximum weight and the farthest the least. With these weights, one of the nearest neighbors is selected at random, i.e., bootstrapped. The monthly TOC value from the selected site forms a simulated value for the user’s location. This is repeated a large number of times for each month, thus generating an ensemble of TOC values. This nearest neighbor re-sampling is akin to fitting the conditional PDF in equation (1) and simulating from it. The only parameter required is ‘K’, the size of the neighborhood. There are several methods for selecting this, but the heuristic rule, , where N is the number of data points, with its theoretical justifications (Fukunaga, 1990, Lall and Sharma, 1996) seems to work well and has been used by all the earlier applications in the aforementioned references. The Supporting Information provides the steps of the algorithm for implementation.

Results

Results of applying the model to four utilities are shown in Figures 2 to 7. Simulation of monthly influent TOC concentrations for the City of Boulder’sBetasso Water Treatment Plant was first considered. The influent water quality to this plant is impacted by snow melt during spring runoff. The K-nn technique was used to generate 500 simulations of monthly influent TOC values. The statistics of the simulations are compared with that of the observed data at this location to evaluate the performance of the technique. Two simulation strategies are considered to show the versatility and flexibility of the method: (i) the data from the Boulder utility is dropped from the potential pool of neighbors and (ii)the data from the Boulder utility is included in the pool. The first strategy represents a scenario where a utility is considering a new plant that uses a source water without a water quality data history. Furthermore, to demonstrate the flexibility of the method in neighborhood selection, the first simulation strategy is performed under two conditions – one in which only the average annual TOC is considered in the feature vectorby setting the weights of the latitude and longitude to zero in the distance calculation (shown in Figure 1a) and in the other all the three variables in the feature vector are given equal weights (Figure 1b). This illustrates the impact of weighting the components of the feature vector on the resulting neighborhood. erange of the neighborhood is quite different in both of the cases, illustrating the impact that the weights can have on the neighborhood selection. The weights in the neighbor selection can be optimized [Young, 1994, Yakowitz and Karlsson 1987] or prescribed by the user relevant to the situation. For instance, if the variability of the water quality variable is similar longitudinally, then the weights on the latitude can be reduced or eliminated thus constraining the neighborhood to the longitudinal direction.

Figure 1 Nearest neighbors (black dots) for the Boulder utility being simulated (black outlined triangle) where weights for latitude and longitude are equal to zero (a) and all weights are equal to one (b).

The simulations from the first strategy where the data from the Boulder utility is dropped from the pool, under the two conditions described above, are shown in Figures 2 and 3. The simulations for each month and their annual average simulated year’s annual average(Ann) are shown as boxplots in which the box represents the 25th and 75th percentile, the whiskers show the 5th and 95th percentiles, points are values outside this range, andthe horizontal line represents the mean. The box plots show the range of uncertainty, with a wider box indicating larger uncertainty. The triangles connected by dotted or solidlines correspond to the observed ICR data: 6 months in1997 and 12 months in 1998. Also shown in the figures, as squares,are theutility’s three year average (3YA) monthly and annual values from 2003 to2005.If the observed value falls within the box (between the 25th and 75th percentile) it suggests that the simulations well-capture the historical properties.

As shown in Figure 2, when the simulations do not include the latitude and longitude in the neighbor selection, the seasonality present in the TOC data and the 3YA annual averageis not well simulated. Only 2 of the 18 monthly ICR values fall into the innerquartile simulation. However, the 1998annual average value is well captured by the simulations even though the data at Boulder is not used. Simulations using all three of the conditioning variables, as shown in Figure 3,better capture the seasonality. Here, we can see that the simulations still overestimate the TOC values from the ICR: only 7 of the 18 monthly ICR values fall into the inner quartile prediction. However, seasonality is better captured and the three year average monthly and annual average values aremuch better represented than when the latitude and longitude are not utilized (Figure 2) The simulations are quite good considering that the data from the Boulder location is excluded.

Figure 2Box plots of simulated monthly and annual TOC concentrations for the Boulder utility where the neighborhood is based only on the average annual TOC. 1997 and 1998 ICR data from the Boulder utility data not included in the pool of data for bootstrapping. 3YA represents the three year average of monthly values for 2003-2005.

Figure 3Box plots of simulated monthly and annual TOC concentrationsfor Boulder utility where the neighborhood is based equally on all the three variables of the feature vector. 1997 and 1998 ICR data from the Boulder utility are not included in the pool of data for bootstrapping. 3YA represents the three year average of monthly values for 2003-2005.

We also performed a simulation in which the neighborhood was selected only on the basis of latitude and longitude. The neighborhood in this case (see Supporting Information, Figure 8) was similar to that using equal weights to all the variables (Figure 1b) consequently the simulations (Supporting Information Figure 9) were similar to Figure 3. However, this may not bethe case for other utilities.

Simulations from the second strategy where the ICR data from Boulderwas included in the pool and the weights for all the three variables in the feature vector were equal are shown in Figure 4. As expected, the 1997 and1998 ICR observed values fall within the inner quartile range for 14 of the 18 months and the seasonality is very well captured. The inner quartile range increased in the winter months (October to March) compared to the first strategy (Figure 3) and decreases in the other months when the spring runoff impacts the water quality. In addition, data representing monthly and annual average observed values from 2003, 2004, and 2005 are overlaid. Here we can see the monthly variability for different years is well captured by the simulations. To quantify the comparison of values predicted by the model to observed monthly values from 2003, 2004, and 2005, the frequency of the observed monthly values within the predicted inner quartiles (box), 75th to 95th and 25th to 5th (whisker),and outliers (>95th and 5th) is plotted in Figure 8. The expected outcome with a large data set would be that 50 % of the observed data would fall into the inner quartiles (box), 40 % into whiskers and 10% into the outliers. For the simulations without the Boulder TOC data (Figure 3) almost 20 % of the observed values were qualified as outliers (>95th and 5th), while when the Boulder TOC data was included in the simulations (Figure 4) the frequency of observed data in the outlier group was reduced to that expected (about 10%) and the frequency of observed data in the inner quartile was increased to a higher than expected (83%). Thus, including the Boulder ICR data in the model development improved the predictive nature of the model. For this case,the improvement was a result of the model’s ability to better capture the seasonal differences.