Supplemental Material: Model building process
We performed NYCCAS data analysis and built land use regression (LUR)-based models to describe observed spatial distributions in each pollutant of interest, following the methods outlined below. Model-building was performedseparately for each pollutant.
STEP 1: Review & summarize all GIS-based covariates and concentration data
First, we examined distributions in each pollutant of interest, using the data collected across the six two-week sampling sessions in each season. This examination was structured as follows:
1.1. Examine distributions of all concentrations.
Identify outliers (for distributed sites, these are points with absolute value greater than mean + 3*standard deviation). Review field logs and data quality flags. If no explanation for outlier found, retain for further analysis.
1.2. Plot and compare two-week means separately, across the six seasonal sessions, to assess temporal variability:
- 5 NYCCAS reference sites:
The trends for each of the five sites are examined individually and on average.
- 150 NYCCAS distributed sites (including both purposeful and systematic sites): Average and standard deviations are plotted across the six sessions.
- DEC ambient monitoring sites:
Average and standard deviations are plotted across the six sessions.
Compare mean trends to ensure that we observe:
(1) consistent temporal trends across the three sets of measurements,
(2) larger (spatial) variance in NYCCAS distributed sites.
1.3. Examine outliers at five NYCCAS reference sites:
1. Examine dissimilarity index ratios to identify outliers
2. Impute a replacement value for any individual reference site measurement if:
a. concentration outside 95% CI vs. other reference sites for that session
b. concentration difference is unidirectional relative to other sites.
1.4. Explore and compare temporal-adjustment methods for covariate selection and modeling
b.1. Difference method: (adj Cit = Cit- Mean_Reft)
b.2. Proportional adjustment: adj(Cit) = Cit* (Refseason/ [Reft])
b.3. Assess correlation between temporally adjusted values by the two methods
STEP 2: Select observations for model-building
Second, we selected a random subset of sites to be used for model-building. Measurements and covariates for the remaining sites (15 %) would be used to validate the model (e.g., to ensure that the model could produce reasonable estimates of measured concentrations at these sites).
2.1. Select measurement sites for LUR model-building:
a. Select validation dataset of approx. 15% all sites (5 purposeful, 20 systematic).
b. Ensure that modeling and validation subsets reflect concentration and source distributions.
c. Build models using 85%;
d. Final model coefficients are derived by re-running the final model on all 100%.
STEP 3: Select “candidate” covariates for consideration in model-building process
We created dozens of GIS-based metrics using a number of spatial datasets available publicly or within NYC databases. Such datasets included New York Metropolitan Transportation Council (NYMTC) street maps and related measures of general traffic and truck traffic intensities, data on permitted emissions sources within the city from the NYC Department of Environmental Protection (DEP), and buildings-related metrics from the City of New York Department of City Planning Primary Land Use Tax Lot Output (PLUTO™) database (e.g., total built space, or built space of a specific type).
From each dataset, we derived site-specific indices of emission sources, most of which were computed for within varying distances (or, within circular “buffers”) from each monitoring site. This resulted in hundreds of covariates used in subsequent analyses.
3.1: Create source categories
In order to consider each of our 600+ covariates as (in many cases, highly similar) indices of key sources, we opted to group covariates into policy-relevant, reasonably interpretable source categories, which were:
(1) Total traffic density
(2) Roadway lengths and distance to major roadway
(3) Truck/ diesel-related traffic terms
(4) Population
(4) Built space
(5) Land use & land cover
(7) Permitted emissions sources
(8) Transportation/distributed facilities
3.2. Produce and examine multiple-buffer correlation structures for each source category.
In order to select an ‘optimal’ buffer for each source metric, and to choose among the many highly-correlated covariates within each source category, we developed a correlation matrix with each metric within a given category (e.g., all traffic measures) listed down the y-axis, and the range of considered buffers from 50 to 1000 m reading along the x-axis. The value of each matrix cell isthe Pearson correlation (r) between that buffer-specific covariate and thetemporally-adjusted concentration.
Below is an example of this covariate matrix, wherein we compare multiple traffic metrics at buffers from 50 to 100m vs. temporally-adjusted PM2.5 concentrations:
3.3. Using matrices like the example above, for each source category, we selected the one best buffer-specific candidate per category. (We chose nymtr_1000)
3.4. To select a second covariate from each category, which would maximally add to explaining variability in the pollutant of interest after adjusting for the first selected variable, we implemented a series of iterative regression. This iterative procedure tested all remaining covariates within the source category, to optimize R2 in predicting the pollutant concentration.
3.5. Returning to the correlation matrix (as above), we chose buffer-specific covariate(s) with similar (or, the second-highest) association with temporally-adjusted PM2.5. (We chose signl_1000).
We then repeated the iterative regression procedure described above to select the covariate which best explains additional variability in concentrations after accounting for (in this case) signl_1000.
Using this process, we selected up to four “candidate” modeling covariates per source category.
3.6. We conducted sensitivity testing on our list of selected candidate covariates by comparison with covariates selected using automated selection strategies: (e.g., RandomForest, Tree structures).
STEP 4: LUR forward stepwise model-building process
We opted to build LUR models using a forward-stepwise manual procedure. Under this primary modeling method, we predicted total concentrations using both temporal and spatial terms (e.g., the mean reference site value for each session, and GIS-based source covariates). The R2 from this model approximates the total amount of variability in each concentration explained by our temporal and spatial terms.
A secondary method was also run in tandem with the first, throughout the modeling process. Under this secondary method, we predicted temporally-adjusted concentrations using only GIS-based (spatially-varying) source terms. The R2 from this model approximates the total amount of spatial variability in each concentration explained by the spatial alone.
The steps below describe in detail the process used to build the total-concentration model (the primary modeling method), implemented in SAS using Proc Reg:
4.1.Incorporate (as a predictor) temporal adjustment in pollutant-specific LUR models:
Retain both temporal & spatial terms in the LUR model, as below:
(Cit = Mean_Reft + Sourcei + …+ bit) for site i time t
4.2. Determine order of category (term) entry into model:
- Opted to first test traffic (for PM2.5, NO2, EC), and built space (for SO2).
- Ordering based on: (1) sampling design, (2) hypothesized sources, (3) strength of correlations with temp-adjusted concentrations, (4) geographic coverage of covariate layers, (5) automated methods output (trees, RF), (6) interpretability.
- Other categories were tested in order of their descending correlations with the temporally-adjusted concentration.
4.3. Enter terms into model, using forward stepwise process:
a. Test ‘best’ source terms from first category. Retained the term if p<.05, R2 improved, and VIF < 2.0.
b. Test optimal “second” covariate. Retain if p < .05, R2improved, and VIF < 2.0.
c. We repeated the process through all candidates, and all source categories.
4.4: Next, we explored the possible effect of site characteristics, which may be vague source categories, or may modify relationships between sources and observed concentrations. We considered: (1) Street canyon, (2) Open space, (3) Tree cover, (4) Elevation, (5) Imperviousness.
a. We screened each covariate by testing its univariate relationship with the model residual achieved to this point
b. Test the covariate in the multivariate model:
(Cit = Reft + Sourcei + Sourcei*Modifierit +…+ bit) for site i time t
4.5. We calculated and mapped model residuals, in order to:
a. Examine spatial patterns in residuals, identify missing spatial covariates
- Build & test appropriate GIS covariates (e.g., inverse distance to bus route).
b. Re-compute the correlation matrices (from 3.2), testing GIS terms in source categories against model residuals to identify additionally useful terms.
c. Incorporate additional covariates into model.
- We opted to retain additional terms if p < .05, R2improved, and VIF < 2.0.
STEP 5: Sensitivity testing and model validation
After model-building, we performed several steps towards testing and validating the LUR model. Our goal was to establish that our resultant model would be insensitive to various permutations in our modeling methods.
5.1: First, we assessed sensitivity to covariate selection, by:
a. Testing all model terms to robustness to outliers/ influential points.
b. Individuallyreplacingeach covariate with other candidates from same category
c. Replacing significant modifiers to assess uniqueness, interpretability of modification
d. Assessingcovariate dependency - remove each term to ensure others retain significance
e. Testing model covariates for sensitivity to adjustment by borough.
5.2: We assessed sensitivity to site selection and inclusion of observations by:
a. Testing parameter sensitivity using bootstrap(both leave-one-out and leave-three-out)
b. Predicting concentrations at validation locations (n = 25).
- We evaluated RMSE, MAPE, other diagnostics of model fit.
- If influential point affects coefficients, we tested both models vs. validation sites, to inform decision on retaining point.
- We estimated final modelcoefficients using all n =150 obs.
5.3: We assess sensitivity of model-building process, by:
a. Assessing sensitivity to the temporal adjustment method:
- We again ran model covariates against proportionally adjusted concentrations:
=> adj(Cit) = (Cit/ [Reft]) * Refseason + Sourcei + …+ bit) for site i time t
providing the approximate percent of spatial variance in concentrations explained by the model.
b. Performing stepwise backwards elimination process from original candidates list
- Sequentially eliminate terms by highest p-value, until all p<.05.
STEP 6: Examine model residualsand map predictions
We mapped and examined final LUR model residuals to identify non-random spatial variation in the model residuals, which could either: (a) be explained by additional source terms not yet incorporated into the model, or (b) be captured in a residual layer incorporated into the LUR-generated final surface predictions.
6.1. We mapped model residuals, examined variograms.
6.2. Mapped LUR-based model predictions across all NYC lattice cells (300 m2), as well as significant covariates for each pollutant.
6.3. We explored two different methods for generating residual surfaces and spatial predictions of pollutant concentrations:
(1) Regression kriging (wherein smoothing restricted to spatial variability unexplained by LUR)
(2) Kriging with external drift (KED) wherein residual surface fit simultaneously with LUR
(3) KED models were used to produce the smoothed estimates as described below.
6.4. We produced smooth estimates for each pollutant across all 7,500 NYC lattice cells.
Due to the small buffers found significant in several of our LUR models (e.g., kernel-weighted traffic within 100m) and resultant spottiness in prediction surfaces, we opted to generate final predictions at 100 meter lattice cells, and to smooth these estimates across space.
The output of the Kriging with External Drift (KED) model was joined to a shapefile with a point in space for each of the 100 meter lattice centroids. This point file was used as the input for a smoothing process called Inverse Distance Weighting (IDW). This process obtained the surfaces presented in the report’s pollutant maps.
IDW creates a surface (an image of values symbolized by color) out of square cells. The value of each cell is determined from a set of sample points, weighted by a function of inverse distance from each point to the cell. The inverse function gives progressively less influence to points further away from the cell. The distances can be raised to a higher power, in order to give more influence to the nearest points, or to a lower power, in order to give more influence to points that are farther away.
IDW was implemented using the Spatial Analyst extension in the ArcGIS Desktop 9.2 software.
-A surface cell size of 25 feet was chosen to provide a fine resolution for the maps in the report.
-A power of 1 was chosen, so that the inverse distance function alone determined the influence of nearby points.
-The nearest 100 points (lattice centroids) were allowed to have influence on each cell value (there were 82,748 points used for New York City).
After the value surface was created for each pollutant, it was symbolized by stretching the cell values along a color ramp from yellow to a very dark red.
6.5. Finally, we produced summary measures of the distributions of estimated pollutant concentrations within each populated NYC Community District. These were generated by joining a shapefile of Community District boundaries to a surface of smoothed estimates generated as in step 6.4, but with a 100 foot surface cell size. This assigned each lattice cell estimate to a Community District. The resulting data table was summarized at the Community District Level to compute the mean, minimum, maximum, 10th and 90th percentiles.