4 - Graphical Displays and Summary Statistics for Numeric Data (i.e. Descriptive Methods for Numeric Data)

4.1 ~ EXAMPLE: Monitoring DDT Levels Found in Fish in the Tennessee River Near Triana, Alabama


Background:
Olin Agrees to Clean Up DDT in Triana, Alabama Area
[EPA press release - April 21, 1983]

The U.S. Environmental Protection Agency announced today that the Olin Corporation has formally agreed to a multi-million dollar cleanup of DDT contamination around its former manufacturing facility, the Redstone Arsenal in Alabama, and to provide for health care for the residents of the nearby community of Triana.

"The agreement marks the first time an EPA enforcement action has provided for health care for an affected population," said EPA Acting Administrator Lee Verstandig.

Verstandig added, "This unique health care provision provides $5 million to establish the Triana Area Medical Fund, Inc., which will provide primary health care and monitoring of the residents. This is a non-profit corporation whose Board of Trustees will consist of representatives of both local citizens groups and the federal government."

Many of Triana's 1000 residents have been found to have elevated levels of DDT in their blood, primarily from the consumption of DDT-contaminated fish caught in Indian Creek, which runs near the plant site. The small community is about 12 miles from the Arsenal.

Felix Wynn, an 85-year-old resident of Triana, has 3,300 parts per billion of DDT in his blood. That is more DDT than has ever been found in any human being.

EPA announced general terms of the agreement on December 30, 1982. The final agreement was lodged with the U.S. District Court in Birmingham, Alabama, April 15, 1983. A notice announcing a 30-day comment period was published in the Federal Register the same day.

DDT was manufactured at the Redstone Arsenal site from 1947 to 1971 by Olin and the predecessor lessee of the site, the Calabama Chemical Co. In 1972, former EPA Administrator William D. Ruckelshaus, currently the EPA Administrator-Designate, banned the use of DDT in this country, except for limited situations.

In the late 1970s, widespread DDT contamination was discovered at the plant site, in nearby waterways (the Huntsville Spring Branch-Indian Creek, a tributary of the Tennessee River), and on more than 1400 acres of the Wheeler National Wildlife Refuge, the largest and oldest national refuge in Alabama. Elevated levels of DDT have also been detected in wildlife in the area.

In 1980, the Justice Department, at EPA's request, sued Olin, asking them to clean up the contamination. In October 1981, the site was designated one of EPA's top-priority hazardous waste sites for cleanup under the new Superfund program. Superfund is the $1.6 billion fund authorized under the Comprehensive Environmental Response, Compensation, and Liability Act of 1980 (CERCLA) which gives EPA the resources to clean up abandoned hazardous waste sites.

Under the settlement, Olin will clean up the DDT residue from the nearby Wheeler refuge and from the sediment of the Huntsville Spring Branch-Indian Creek tributary to reduce DDT levels in fish within a 10-year period.

In addition to paying for the cleanup, Olin will provide $24 million to assist residents of the contaminated area. This includes $19 million to satisfy personal injury claims of over 1000 private parties including local residents and local commercial fishermen.

Under the agreement, Olin must submit a comprehensive remedial cleanup plan for the area to a review panel of federal, state and local representatives by June 1, 1984. The review panel will approve or recommend changes to the plan and will provide complete oversight of the cleanup.

As part of the environmental cleanup, Olin also will provide short and long-term environmental monitoring for all affected areas. The EPA-Olin agreement also settles related suits against Olin by the State of Alabama and three citizens groups.

4.2 ~ Graphical Displays for Numeric Data: The Histogram
Histograms and Outlier Boxplots
To obtain a histogram and outlier boxplot for numeric variable(s) select Distribution from the Analyze pull down menu and place the variable(s) that you wish to examine in the right-hand box. We will begin by examining a histogram for the weight of the fish sampled as part of the study.

Three key features of a histogram

The Horizontal Layout, Prob Axis, Normal Curve & Smooth Curveoptions have been used in constructing the histogram for weight of the fish sampled. The locations of these options are illustrated in the graphics below:

The density or distribution curves are added by selecting the options shown below.

Histograms for length, weight and DDT level found in the tissue of the fish sampled are shown below.

We can see that the lengths of the fish sampled appear to have a skewed left distribution with several outliers(i.e. unusual observations) on the low end. These outliers are all largemouth bass which evidently are generally shorter in length than the other fish species sampled. The typical length appears to be somewhere between 42-45 cm in length. The weight distribution appears to be slightly skewed to the right, but is not far from normal as evidenced by the fairly close agreement between the normal curve and smooth curve distribution estimate. There also a couple of outliers flagged in the boxplot. We will discuss the boxplot in more detail later. A typical weight for the fish sampled is approximately 1000 grams. The DDT concentrations of the fish sampled follow a severely skewed right distribution with many obvious outliers on the high end, with one fish having a DDT concentration of approximately 1100 ppm.

Using the Location and Species columns to label the points in succession shows these observations correspond to catfish and smallmouth buffalo sampled from locations 1, 8, and 13. Examination of the map shows that locations 1 and 13 are in close proximity to the plant on Indian Creek that was the source of the DDT contamination of the ecosystem. To obtain the labeling feature in JMP right-click on the variable name in column list on the left-hand side of the spreadsheet and select the Label/Unlabel option.

Use Label/Unlabelto assign variables to be used as point labels when interacting with plots.


4.3 ~ Transformations to Improve Normality
When the distribution of a variable is markedly skewed (left or right) we can often times use a transformation to obtain approximate normality. The common remedy is to consider raising the variable to some power. This type of transformation is known as a power transformation. To remove right skewness we consider using powers less than 1 such as 1/2 (i.e. square root), 1/3 (i.e. cube root), 0 (which corresponds to a log transformation),
-1/2 (i.e. reciprocal square root), -1 (i.e. reciprocal) , .... etc. As a rule of thumb,we often avoid using negative power transformations because they change the ordering of the data, i.e. the largest observed value with become the smallest and vice versa. Also the associated units of a negative power transformed variable can be difficult to explain. To remove left skewness, which is less common, we typically raise the power of the variable in question(e.g. 1.5, 2 or 3).

Ladder of Powers

From the histogram and boxplot above we see that the distribution of the DDT concentration has a distribution which is extremely skewed to the right. To improve normality we will consider transformation to the log scale. (Note: This is a very common transformation to use when working with toxicity data. The logarithmic transformation is one of the most commonly employed transformations in statistics!)

To do this in JMP you must use the JMP Calculator which allows you to perform a variety of data transformations and manipulations. To create a column containing a function of another column, double-click to the right of the last column to add a new column to the spreadsheet. Next double-clickat the top of the column to obtain the column information window. In the window change the name of the new column to log10(DDT) and select Formula from the New Property pull-down menu and click Edit Formula as shown below.

The JMP Calculator should then appear on the screen. To take the base 10 logarithm of the DDT levels, select Transcendental from the menu to the right of the calculator keypad because the logarithm is a transcendental (non-algebraic) function. In the list that appears in the rightmost menu select base 10 logarithm (i.e. log10). In formula window you should see log10( ). Now you need supply the name of the variable you wish to take thelogarithm of, which is DDT in this case by selecting it from the list of variables left of the calculator keypad. Finally click Apply and close the calculator window. The new column you created should now contain the base 10 logarithm of the DDT concentrations. The histogram and boxplot for the log base 10 scale DDT readings are shown on the following page. We can clearly see approximate normality has been achieved through transformationof the DDT levels to the log base 10 scale.

4.4 ~ Types of Summary Statistics

  • Measures of Central Tendency, Typical, or “Average” Value
  • Measures of Spread/Variability
  • Measures of Location/Relative Standing

4.5 ~ Measures of Central Tendency (mean, median, and mode)

Notation for Observations or Data

whereith observed value of the variable x and n = sample size

Mean

Sample Mean Population Mean

Example:

Median

Middle value when observations are ranked from smallest to largest.

Sample Median (Med)Population Median (M)

Example:

Mode

Most frequently observed value or for data with no or few repeated values we can think of the mode as being the midpoint of the modal class in a histogram.

4.6 ~ Measures of Variability (range, variance/standard deviation, CV, and interquartile range)

Range

Range = Maximum Value – Minimum Value

Example:

Variance and Standard Deviation

Sample Variance ()Population Variance ()

Sample Standard Deviation ()Population Standard Deviation ()

Example:

Chebyshev’s Theorem and the Empirical Rule

These are used to determine the percentage of observations that lie within in certain intervals centered about the mean. The intervals have the form:

mean standard deviation

where k is a positive integer. As an example consider the gestational age of infants at the time of birth.

Chebyshev’s applies for any non-normal distribution while the empirical rule applies only for distributions which are approximately normal.

Interval Chebyshev’s Thm Empirical Rule

In 1949, a divorce case was heard where the husband filed for divorce on the grounds of his wife’s adultery. The only evidence he had was the fact she gave birth to a child 50 weeks (350 days) after he had gone abroad on military service. The judge hearing the case agreed that though it was improbable a woman would carry a baby 350 days, it was scientifically possible and the child could have been his. Thus the judge did not grant him a divorce. What do these rules say about the likelihood of a gestation age 350?

Coefficient of Variation (CV)

Measures spread relative to the size of the mean.

Example:Which has more variation, length (in.) or weight (g)?

4.7 ~ Measures of Location/Relative Standing

(Percentile/Quantiles and z-scores/Standardized Variables)

Percentiles/Quantiles

Quartiles

InterquartileRange (IQR) (another measure of variability)

Outlier Boxplots

Any observations lying more than 1.5belowor more than above are classified as outliers.

Standardized Variables (z-scores)

The z-score for an observation is (sample) or (population).

It tells us…

Example:

Which is more extreme a catfish 24 inches in length or a smallmouth buffalo 13.5

inches in length?

Standardizing Variables in JMP

Histogram of standardized lengths.

4.8 ~ Summary Statistics - Measures of Central Tendency, Variability and Location in JMP
When we examine the distribution of a numeric variable in JMP (Analyze > Distribution) you will automatically obtain basic summary statistics. The summary statistics for length, weight, and DDT level for the fish sampled as part of the Tennessee River study are shown below.

To obtain the variance and coefficient of variation you also need to select

More Moments from Display Options pull-out menu.


4.9 ~ Comparative Displays
In this study we could compare the DDT levelsof the different fish species and also compare DDT levels of fish by location. We first consider the potential difference in the DDT levels in catfish found at different river locations by using comparative boxplots. Because of the profound right skewness in the DDT levels we will use the DDT levels transformed to the logarithmic scale. To obtain basic comparative display in JMP select Fit Y by X from the Analyze menu and put Location in the X, Factor box and log(DDT) in the Y, Response box. The resulting display will show the log(DDT) levels plotted versus the location number. To add boxplots or items to this plot use the Display Options menu located within the main pull-down menu.

The display on the below shows comparative boxplots for log(DDT) level across location with the X-axis proportional option turned off.

Here we can clearly see that the fish from Flint Creek (309 miles) & Tennessee River (320 miles) have the highest DDT levels and fish from Tenn.River (285 miles)Tenn.River (345 miles) appear to have the lowest. It is important to note that latter locations are the only locations where largemouth bass were sampled.

Sample Percentiles/Quantiles by Location

To convert these summary statistics back to the original scale use the following:

e.g.

We can construct a similar display for comparing the log DDT measurements across species by placing Species Name instead of location in the X box.

To obtain summary statistics for the log(DDT) levels within each species type select Quantiles and Means and Std Dev from the Oneway Analysis pull-down menu. The results are shown on the following page.
Summary Statistics for log10(DDT) by Species

How do different species compare in terms of summary statistics?

Catfish clearly have the highest mean and median DDT levels in the log scale while largemouth bass have the smallest. Catfish have the smallest amount of variation and seen by comparing the standard deviations or the coefficient of variations.

To convert these summary statistics back to the original scale use the following:

For example,when converting the median DDT level found in catfish in the log 10 scale back to the original scale we have

CDF Plots

A CDF plot shows the estimated probability or chance that we observe a value less than or equal to given value. The CDF plot for the fish lengths is shown below.

For example:

We estimate that the chance a randomly selected fish is less than 40 cm is ______

We estimate that the chance a randomly selected fish is less than 50 cm is ______

The plot below gives the CDF plots for the DDT levels found in each the fish species in this study. To obtain these select the CDF Plots from the Oneway Analysis... pull-down menu. We can clearly see that we are much more likely to find a catfish with a high DDT level, e.g. there is an approximate 50% chance that we sample a catfish with a log10(DDT) level exceeding 1 which is 10 ppm in the original scale. This same chance for small-mouth buffalo is less than 25% and estimated to be 0% for largemouth bass.

1