4-Basic Data Validation

  1. There are several areas of data validation:
  2. site operator knowledge and logbook records,
  3. instrument flags,
  4. data checks,
  5. comparison of parameters from one year/season to the same or previous year/season, and
  6. comparison with other parameters at the same site and of the same parameter at nearby sites.

Site operator knowledge and logbook records can be used to flag data based on dates, which can be relatively easily done either in excel or access. This module will present a method to flag and exclude data based on instrument flags and data checks. Data checks include looking at:

i.the range of values (negative, less than the MDL, or very high values),

ii.rapid changes in values from one hour to the next, and

iii.“stuck” (repeated identical) values.

Open the excel file 4-Ozone2010.xls. This data is all the Septemberozone data from a tribal site in 2010. We are going to assume that all the QC checks passed (we will look at the QC data from this site in the next module), and that according to the SOP, the analyzer was taken offline during the QC checks so that none of these represent QC data. However, it is important to verify this, and in looking at the QC records we see that on 9/17 a span check was conducted, with this record:

12:06 / 9/17/10 / API / 401 / 404

In looking at the data, we see that at noon on that day there is a record showing 406.9 ppb, so we know right away that the analyzer was not taken offline, and that that record should be invalidated.

  1. Before doing anything to the raw data, make a copy of the “original data” sheet by clicking in the left upper blank gray square and copying the entire sheet into a new sheet, and name this new sheet “review”.
  1. Now, invalidate that row by entering a QC in the Flag field of that record (9/17 at noon). Make notes of the flags you apply, and be consistent from one month to the next. A list of the flags used in this exercise is at the end of this module description.
  2. Review the “review” sheet using autofilter. Verify that the columns contain the type of data expected, and note the range of data, esp. in the Concentration field.
  3. Make notes of what looks questionable. What are the highest and lowest ozone values? The Flag field has already flagged data that is negative, very high, or questionable in some way (some of it looks fine to me, but it was flagged for something so we will try to figure out why). These R values in the Flag column are instrument flags, or were generated by the site operator at the time. Before continuing, remove the autofilter.
  1. First, data that is less than the MDL for this instrument according to EPA can be invalidated, based on the MDL list (see spreadsheet “AQS MDLs and Method Codes.xls” with the MDLs and other AQS-related values and codes). (Remember that all negative values should not be necessarily invalidated—check the MDL list first. For PM10 BAMs and TEOMs, data are valid down to -50 ug/m3 although you can report values as low as the negative of the MDL, with a QA-type of flag [not null value code].) For this instrument (and all ozone FEM instruments in this list) the value is 5 ppb.

Create a column to the left of the Flag field (click any cell in the Flag column, select Insert, Sheet columns), and insert =IF(E2<5,"MDL","") in G2, then autofill downby doubleclicking the black cross in the lower right corner of that cell. Name this column MDL.

  1. Rapid changes from one hour to the next are also signs of instrument instability. Create two new columns to the left of your new MDL column by clicking any cell in the MDL column, then Insert, Sheet Columns (twice). Name the one now in column G “hour change” and the one to the left of the MDL column “HD” (for high difference).

Insert this formula in row 3in the“hour change” column (G if you are using the same columns as in the screenshots), which will give you a “1” if there is only one hour between the rows, and a “2” if there is a gap of 1 hour, between rows, etc: in row 3 =C3-C2 and then copy this formula down. (In case you find row-to-row changes that are greater than 30 ppb, for example, you want to make sure that those are really hourly changes and not due to a gap of hours or days between rows.)

Insert this formula in row 3 under the HD header: =E3-E2 and copy it down using the black handle and autofill. (Not row 2, because this value needs to be in the current row, reflecting change from the previous hour, which doesn’t exist in row 2.) This fills the HD column with the row-to-row difference. This allows you to check for rapid changes in concentration. In this case, the site operator thinks that a change of more than 30 ppb means that there is something wrong with the instrument. (You may think 20 ppb or a different hourly change might be important-that is a site operator judgment call.) We will filter out rows where the change was more than 30 ppb in 1 hour after we have inserted one more column.

  1. To check for “stuck” values, insert a column to the left of the “hour change” column and label it “repeated.” Enter a value of 1 in the cell under the header (G2).

In the thirdrow, under the cell with the 1 in it, type or copy this: =IF(E3=E2,G2+1,1) [this assumes your concentration column is column E.] This formula calculates how many times the exact same value repeats, which is generally a sign of bad data. Double click the black autofill cross to autofill the formula down.

Before we begin to review the data for:

  • the range of values (negative, less than the MDL, or very high values),
  • rapid changes in values from one hour to the next, and
  • “stuck” (repeated identical) values
  1. Invalidate values that you know are bad data, because it is useless to continue and look at ranges if there are values of -6999 in the data. (Note that this is for YOUR analysis. You should report all validated values to AQS, and EPA will do whatever they do with it—replace it with the MDL, half the MDL…) Use autofilter, click the down arrow next to MDL, and uncheck the box next to MDL, so that the column shows only the blank cells. You can see that the down arrow next to MDL now has a funnel in it, showing that data are filtered out in that column. Also, the row labels are now blue, with a slightly dark line in between missing rows. There are many rows (226) with values less than the MDL. Click cell A1, hold down the ctrl-shift-right arrow to select the first row, then still holding down ctrl-shift click the down arrow to select the entire range of data with the range selected click cntrl-C to copy. (Now you can see the las vegas lights between filtered-out rows, and other rows that are filtered out): then open a new sheet, and click in the upper left of that new blank sheet, then click ctrl-V to paste only the visible values. In the new sheet, click on any cell in the date field then click ctrl-down arrow to zoom to the bottom row, and you can see that there are 492 filled rows, as opposed to the 718rows in the unfiltered data in the “original-data” sheet. Name this new sheet “>MDL.”
  1. In this new “>MDL” sheet,note that the formulas from the 3 new calculated rows “HD”, “repeated,” and “hour change” did not copy over, which is good because we want only the values from the columns, and not recalculated values based on now-filtered-out rows. So, we can quickly review those records where an R flag was applied by the site operator. Use autofilter to show only those rows with the R filter (you can tell by looking at the bottom of the sheet where it tells you after you have applied the filter how many are now showing (31/491) that have the R flag.

Now, you can either look row-by-row by “hand,” or set a filter on the HD row and the hour change row so that you can more quickly invalidate rows where (a) there is an R flag, (b) there is a HD, and (c) the hour change is only 1. With autofilter showing only the R rows, apply a custom autofilter to the HD column as shown:

And the custom filter:

NOTE: Essentially, we’re looking for values that fall within this range: -30 ≤ x ≤ 30

then apply autofilter to the “hour change” column so only rows where the HD is based on a 1 hour change (check the boxes next to 1 and -23, but uncheck the 4 box).

(If you were reviewing this as your data, you would consult your logbook to see what was happening during this month since there are so many high HD rows.) At this point, you are looking for rows that you think should NOT be invalidated based on a change of more than 30 ppb between hours. Check for whether the concentration steadily but rapidly increasing as it might in the middle of the afternoon, so a change of 30 ppb “fits in” with the data around it.

You can see that ALREADY only rows with HD are showing, which indicates that the site operator knew something was awry with these R-flagged rows. If you scan through the values in the HD column using the list in autofilter, you can see that there are no HD values that look like there is even a chance that the data are valid:

This leaves 30 out of 491 rows visible, and unless the site operator knows something about these rows that indicate these values are valid, I would invalidate these rows. A change of more than 30 ppb (especially if it goes up then down then up, etc.) is not how an ozone analyzer should be operating in ambient air (at least in this location). I invalidated these rows by typing in RHD into the “Flag” column (meaning that both the R flag and the HD flag combined to invalidate the data) in the first row, and then copying down 30 times by clicking in the cell under the first RHD, then entering ctrl-“ to copy the above cell into the selected cell. You can also copy and paste RHD into the entire column at once. (type in RHD in the first cell, then select that cell to copy it, then select the range using crl-shift-down arrow, then cntrl-V to paste RHD into all the cells.) This pastes RHD into only visible cells, which you can check after you paste by showing all values, and you can see that there are gaps in the RHD rows which is what you want:

  1. Before continuing, is it useful to invalidate more data at this point? After talking with the site operator, who is familiar with the past several years of ozone concentrations at this site, we decide to investigate and probably invalidate all values greater than 300 ppb, just to make the next steps easier. Clear your existing filters, then use autofilter to select all rows that are not already flagged where the concentration is greater than 300 ppb (first apply autofilter to select only blanks in the Flag column, then autofilter to select rows where the concentration was greater than 300 ppb.) You can see that there are rows that are not already flagged, but for all of these 15 rows the HD change was greater than about 200 ppb since the previous hour, so invalidate these rows by entering >300 in the Flag field. You can enter it in the top visible row, copy that cell, select the rest of the range, then copy it into the selection, which only copies this into the visible cells:

Remove the autofilters so all data is shown. Now, just autofilter the Flag field and show only blanks and R flags (there is only one R-flagged row left).

Copy this into a new sheet named “<300>MDL&not_RHD”

At this point, you should have carried 448 records over to “<300>MDL&not_RHD” from the original sheet (“>MDL”) which contained 492 records.

At this point we are fairly confident that at the least the RANGE of ozone values are valid, so it makes sense to apply conditional formatting. (Before, all the -6999 and other weird low values as well as the weird and invalid high values would still be in the data making it hard to review.) Conditional formatting is helpful, because, say, if we find a suspicious value and just use a filter to find just that one row, we cannot see the data adjacent to that value. In conditional formatting high values appear red, making them easy to spot while looking at the adjacent data.

In the new sheet, with autofilter turned off, select all of the Concentration column by clicking on the E header, then click Home, Conditional Formatting, Color Scales, More Rules: Format all cells based on their values, 3-color scale, and change the low to green and the high to red:

Which gives you colors in the concentration column.

  1. Use conditional formatting to format the HD column and show yellowfor values where the ppb changed by more than 30. So, select the entire column I, Conditional Formatting, Highlight Cells, More Rules: and and format only cells that contain, Not Between, cell value less than -30 and greater than +30:

and set the format to be yellow by clicking Format, then click the Fill tab:

  1. For ease in reviewing the data, click in cell C2 and then View, Freeze Panes, to keep the first two columns (date and time) and the header row always visible.

Note that the formulas do NOT copy (you can copy them if you Paste Special) which is good, because we want the calculated VALUES of HD and hour difference to be in the new sheet (we filtered out rows, so we do not want those values recalculated.)

These colors will help you more quickly scan the rows. A red value in concentration AND a yellow value in the HD column is a bad sign for that row.

  1. Review the data for “stuck” values by autofiltering in the “repeated” column to see how many values repeat.

In this case, there is a row where the same value has been repeated six times. Uncheckthe 1 in the dropdown filter (the 1 shows all rows where the value just occurred once, so it shows all rows), then make a note of these date/times (or row numbers) and the values that appear with the check box on 2 through 6. Remove the filter to look at those rows with their adjacent data. The value of 23 that is repeated six times does look like a “stuck” value and I would invalidate these rows by entering an S in the Flag column.

Remove the filter on the “repeated” column.

  1. Filter out these “S” rows, by showing only “Blanks” in the “Flag” column.
  1. Review the data for HD values, which are those that now show yellow where the row changed by more than 30 ppb. If you have less than a quarter of data, it can be relatively easy to scroll through the data and look at the conditional formatting if you slide the magnification slider in the bottom right of the screen far to the left, so that you can see several weeks of data at once, but are only looking for the yellow cells in the HD column. This allows you to see that there really doesn’t appear to be a correlation between high concentrations (redder cells) with yellow HD cells. (In other words, the analyzer varies greatly hour-to-hour at all concentrations.)

You want to see a gradual change in the color of the cells in the concentration column, and generally you will invalidate data where there is both a high (red) value shown in the concentration column and a high (yellow) HD, meaning the concentration jumped more than 30 ppb.

  1. At this point, you are looking for HD rows that you do NOT want to invalidate. If the data show a consistent up or down trend, even if the concentration changed by more than 30 ppb, you might want to keep the data as valid. For example, rows 32 and 33 show decreasing concentration, but the trend is both consistent and also sensible in terms of the time of day. To keep these as valid, enter a V in the Flag field. I would also validate row 113because it is consistent with the hours around it, by changing that flag to V. These are the only 3 rows of the highlighted HD rows that I would defend as valid based on the information we have. Autofilter confirms these 3 rows if you set Flag to show only V values:
  2. Now we can invalidate the remaining HD rows. First go through the different added review columns to make sure that you have not left out any repeated rows or weird flags. Filter the Flag column to show only blanks: