Steps of evaluation:

  1. Clean up your data—see Data Cleanup folder. Eliminate inconsistent field names, gaps, error codes.
  2. Verify QC limits met—see QC folder
  3. Review logs
  4. Evaluate completeness
  5. Enter QC data in DASC spreadsheet
  6. Flag suspicious data that can be easily eliminated later using queries based on flags.
  7. Aggregate and organize data into sets; see folder “Aggregate Data.” Import data into MS Access using wizard (use wizard and see Tribal Data Access Quick Start Guide) Reduce the number of data points from hourly to daily or 5-day averages.
  8. Find Patterns: With the data sets aggregated using Access, use Excel to plot data for common sense evaluation(whether data meets expectations). Evaluate correlation with other parameters such as solar radiation, air temp, and NO2. For multiple sites, look at differences in the correlation between sites.
  9. Temporal
  10. Plot vs met or other parameters to evaluate whether data makes sense
  11. Use scatterplots, add trendline, show R2 and equation on graph, or use Excel’s slope and RSQ functions to quantify the correlation between the parameters.
  1. Ask your question: in this case, is there a statistically significant difference between the data gathered between ## and ## and the data before and after? BE SPECIFIC. Decide on your assumption that the data must be used to prove wrong. This is the null hypothesis. If you think the data from 2 sites are different, then assume that the difference between them is zero and then the data must prove that wrong, at some level of confidence. In this case the null hypo is that there is zero difference between the means of the datasets.
  2. Evaluate the shapes of the distributions. In most cases environmental data will not be normally distributed, but only “approximately normal.” Normally distributed data can be analyzed using familiar tests such as t-tests that are available in Excel Data Analysis Toolpack. If data is clearly non-normal (all over the place with many low and many high values) then you should use non-parametric tests, which are not as difficult to use as you might think. Tools for plotting data, assessing normality, and non-parametric tests are available free from DOE by downloading and installing the Visual Sample Plan ( and ignoring everything except the Tools, Analyze Data function.
  3. If you decide the data are normal enough to use the familiar t-tests (called parametric tests), then decide on which test to use. If you are comparing two datasets, use Excel’s F-test to evaluate the shape of the distributions of each data set. If the variance (spread) of each distribution is similar, then use Excel’s t-test-with-similar-variances. If the variance is not similar, use Excel’s t-test-with-different-variances test. In this example use Excel’s t-test to answer question on difference between datasets.

Breakdown of stats analysis:

First question to be answered: Is there a significant difference between the validated data and the invalidated data? Then, if there is, what is the difference and are there correlating factors such as wind speed, direction, or ambient temp?

A. Evaluated distributionsto answer the question of whether they are lognormal. Plotted Q-Q plots and they are lognormal looking enough to use parametric statistics: (one example here)

B. Decide on how to aggregate data:First analysis uses the value of 5-day avgs of (Ute 3 - Ute 1) / Ute 3 daily max values. This was used because if the difference between the two sites’ data is different for the invalidated data, you could conclude that that ratio provides information causing one to be suspicious about that subset of data.

The two datasets: VALID includes 5-day averages for dates between 1/1/05 and VALID including dates from 1/5/01 thru 1/1/05 and 8/20/07 and 4/28/08 (N=265 ) and INVALID for dates between 1/1/05 and 8/20/07 (N= 172):

C. Assessed variances of datasets: are they equal? No:

INVALID / VALID
COV / 0.6626 / -21.5942
stdev / 0.1299 / 0.1703
avg / 0.1960 / -0.0079
FTEST:
1.5E-04 / this (probability, not test statistic) is BEYOND 0.05, so do reject H0
H0: variances are equal
H1: variances are not equal
cannot use equal variances t-test

D. Applied the t-test assuming unequal variances, to answer the question: IS THERE A DIFFERENCE BETWEEN THE INVALID AND THE VALID DATA? (what is the probability that they come from the same underlying distributions?)

t-Test: Two-Sample Assuming Unequal Variances:
INVALID / VALID
Mean / 0.1960 / -0.0079
Variance / 0.0169 / 0.0290
Observations / 172 / 265
Hypothesized Mean Difference / 0
df / 424
t Stat / 14.1543
P(T<=t) one-tail / 8.01E-38
t Critical one-tail / 1.6485
P(T<=t) two-tail / 0.0000
t Critical two-tail / 1.9656

ANSWER: no, there is hardly any chance at all that they come from the same distributions based on this test.

E. Is there a difference in the correlation between two sites for invalid vs. valid data?Is there a difference in the correlation between the two sites during the valid datesvs during the invalid dates? (i.e., when one is high is the other high also? And does this track differently for the valid vs the invalid data?) To answer this question, just did a scatter x-y plot with equation:

(OR calculated R squared for datasets with too many values to construct a graph)

This is inconclusive:

R squared for Ute 3 vs. Ute 1 max daily all = / 0.483
2377 / is N of Ute 1 Max Daily
2598 / is N of Ute 3 Max Daily
R squared for Ute 3 vs. Ute 1 max daily INVALID = / 0.539
929 / is N of Ute 1 Max Daily Invalid
938 / is N of Ute 3 Max Daily Invalid
R squared for Ute 3 vs. Ute 1 max daily VALID = / 0.669
1448 / is N of Ute 1 Max Daily VALID
1660 / is N of Ute 3 Max Daily VALID
R squared for Ute 3 vs. Ute 1 FiveDayAvgsOfDailyMean_All = / 0.565
437 / is N of Ute 1 Five-day avgs of Daily Mean
437 / is N of Ute 3 Five-day avgs of Daily Mean
R squared for Ute 3 vs. Ute 1 FiveDayAvgsOfDailyMean_INVALID= / 0.674
171 / is N of Ute 1 Five-day avgs of Daily Mean INVALID
171 / is N of Ute 3 Five-day avgs of Daily Mean INVALID

R squared for Ute 3 vs. Ute 1 FiveDayAvgsOfDailyMean_VALID= / 0.686
266 / is N of Ute 1 Five-day avgs of Daily Mean VALID
266 / is N of Ute 3 Five-day avgs of Daily Mean VALID