Demonstration of the Java version of the Pinkham-Pearson index for the comparison of community structure
By
Pinkham, C. F. A.,* J. Gareth Pearson,** Brian P. Reid,+ and Victor T. Chevalier++
Abstract
The Pinkham-Pearson index of similarity has been evaluated by EPA as one of the more powerful tools for comparing community structure in its rapid bioassessment protocol. However, its use has been limited because the program that ran it, BioSim1, was only available in DOS format. A user-friendly, beta version of BioSim2 is now available in a Java format that can run on Windows, Mac OS, Linux, or anything with Java v1.4 or higher. Its use and power will be demonstrated using real data. If you have data you'd like to try, bring it in an xls, csv, or other common spread sheet format. Copies of the beta version of BioSim2 will be available.
Current Address:
*Biology Department, Norwich University, 158 Harmon Drive, Northfield, VT 05663
**USEPA, PO Box 93478, Las Vegas, NV 89193-3478
+DMS Computing, Dartmouth Medical School, 1 Rope Ferry Road, Hanover, New Hampshire 03755
++431 Isom Road, Suite 125, San Antonio, TX 78216
Introduction
The index of similarity, B (Pinkham and Pearson, 1976), was proposed as a means for determining the impact of pollution on communities. Pinkham and Pearson showed that the index overcame many of the shortcomings inherent in other indexes used for the same purpose and was more versatile. Its use was coded in a DOS program (BioSim) (Pinkham et al., 1975).
Since its publication, it has been widely used for diverse investigations. In 1989, Plafkin et al., included it in EPA’s rapid bioassessment protocols for use in streams and rivers. In 1990, it was identified by EPA as one of six commonly used community similarity indexes in a manual describing guidelines and standardized procedures for using benthic macroinvertebrates to evaluate the biological integrity of surface waters (Klemm et al., 1990). At least one state, Vermont, has adopted B as a legal requirement for assessing surface water quality (Vermont Department of Environmental Conservation, 1990). Barbour et al. (1992) in a systematic comparison of the metrics proposed in EPA's rapid bioassessment protocol (Pfalkin et al., 1989), concluded that B "may be the most appropriate metric to serve as a measure of community similarity."
In almost all published cases of its use, however, it was not being used to its full potential. Recognizing this, Pearson and Pinkham (1992) published a strategy for using it in an improved DOS version (BioSim1) (Gonzales, D., et al., 1993). However, this strategy also failed to encourage a widespread use of its full capabilities. It soon became apparent that the major reason for this shortcoming was the DOS platform of BioSim1. This paper introduces the Beta version of the Java program (Reid, et. al., in prep) that finally overcomes that shortcoming.
Method
BioSim2 was written with an entirely new appearance and a simpler approach while retaining most of the features discussed in Pearson and Pinkham (1992). A major change was to drop the agglomerative clustering method for forming dendrograms used in former versions of BioSim (Bonham-Carter, 1967) in favor of the simpler, average link method (Pankhurst, 1991). This method searches through each possible pair of unlinked parameters and an average B-value is calculated for the pair [see Pearson and Pinkham (1992) for a definition of these terms]. This average B-value is determined in three possible situations. 1) Both parameters are not found in any other cluster formed already. This normally happens toward the beginning of the process. In this case, the average is their single B-value. 2) One of these parameters is already part of another cluster. All B-values involving the unlinked parameter and every member of this other cluster is averaged. 3) Both parameters belong to existing clusters. All B-values between the two clusters are averaged. After searching the complete set of unlinked pairs, the pair with the highest average B-value is linked at that average value.
In addition to the above, the authors decided 1) to reduce the presentation of the program to a single screen that would sequentially display the original data and then the results; 2) to make the conditions of the original data matrix flexible enough that most presently-used spreadsheet formats would be acceptable, 3) to provide a plot of the actual B-values used to calculate each average B-value found on the dendrogram, to visualize how well each joining branch of the dendrogram represents the actual distribution of B-values that formed that joining branch, 4) to reorient the dendrogram 180° so that the plot resulting from 3) could be easily compared with the dendrogram, 5) eliminate the matrix of cophenetic correlation coefficients (Kaesler, 1970) in favor of the simpler and more valuable cophenetic correlation coefficient for the entire dendrogram and 6) to eliminate many of the choices that needed to be made in BioSim1 by making most options automatic in BioSim2.
Results/Demonstration
The best way to provide a demonstration is to walk through a sample study.
Figure 1 shows the compressed data matrix (Peason and Pinkham, 1992) for a study conducted in Alaska by one of us (Pinkham, In prep) in 2002. Note that this is an Excel spread sheet. It was merely copied and pasted into the data field of BioSim2. Also note it is necessary to have headings for all columns, including column 1.
Figure 1. The compressed data matrix used for this demonstration.
Figure 2 shows the screen from BioSim2 after entering the data. This screen is obtained by opening the program, typing in the titles for the overall analysis, rows and columns, selecting which version of B to use, determining what to do with low denominators, and pasting the data in the data field. These data are then processed by selecting “Process Data” from the “Actions” option on the menu bar. The entire process should only take seconds.
Note the scroll bar at the bottom of the data field, indicating the compressed data matrix is wider than the screen display. A similar scroll bar appears at the side if the compressed data matrix is longer than the screen display. The only differences between the opening screen and the screen after the data is processed, is that the message, “Data successfully processed.” is entered where “Welcome to BioSim2.beta.30!” is located and the color of the background bar changes from gray to green. Notice the tabs at the bottom of the screen. These will be sequentially selected below to observe the output of the program.
Figure 2. Appearance of the screen after entering the data
Figure 3 shows the resulting row dendrogram. A new feature is the addition of the average B-value at each joining branch. Note that the conditions selected in the first screen are printed at the top and the cophenetic correlation correlation coefficient for the entire dendrogram is also given (rcs = 0.892).
Figure 3. The dendrogram of sites
Figure 4 shows the cophenetic correlation plot for this diagram. Note that it is not the next item on the tabs, but it is appropriate to show it here in line with the dendrogram that it represents. Each joining branch on the dendrogram has its appropriate location on the cophenetic correlation plot and they are lined up in this presentation so the scatter associated with each joining branch can readily be assessed. All points are distributed close to the line, so this dendrogram is a very good representation of the relationships among the sites.
Figure 4. Cophenetic correlation plot for the dendrogram of sites
Figures 5 and 6 show the resulting column dendrogram and cophenetic correlation plot, respectively. The resulting scatter is still fairly close to the line, with the scatter about the joining branch at 0.278 being the greatest and thus suggesting that the most caution should be applied to interpreting conditions revealed by this joining branch.
Figure 5. The dendrogram of taxa
Figure 6. Cophenetic correlation plot for the dendrogram of taxa
Figure 7 shows the matrix of B’s for both the rows and columns. Notice the scroll bars that appear automatically to accommodate large matrices. Although not usually examined, these matrices are good to have to find B values calculated between all pairs of parameters if they are needed to check how tightly one parameter links to another.
Figure 7. Row and Column Matrix of B’s.
Figure 8 shows the final, and most desirable output: the original compressed data matrix rearranged in a two-way table (Pearson and Pinkham, 1992) representing the original data rearranged in the order indicated by both dendrograms (double dendrogram ordered original data matrix). Note that it is provided in the manner that the original data were entered and in a transposed manner. Often one way will fit into a document better than the other.
Figure 8. Original (compressed) data matrix rearranged in double-dendrogram order.
Finally, Figure 9 shows a transposed, double dendrogram ordered, original data matrix as it is printed out by BioSim2 with embedded, colored annotations and annotations to the side and bottom that must be done by hand. The former is obtained by selecting the “Print” option from the menu bar and selecting “Transposed Reordered Data.” The data displayed on all these screens can similarly be printed out as separate pages.
Note in this figure that the red vertical line separates the two divisions indicated by the first joining branch in the site dendrogram at an average B of 0.271. The division to the right of this line is further separated by a green vertical line into the two divisions indicated by the second joining branch at an average B of 0.430. Switching to the divisions indicated by the taxa dendrogram, note that the red horizontal line separates the two major divisions indicated by the first joining branch in the taxa dendrogram at an average B of 0.169. The second joining branch further separates the division above the 0.169 line into two divisions at an average B of 0.233. The third joining branch subdivides the division below the red line into two divisions at an average B of 0.268. Finally, the fourth joining branch subdivides the division above the top green line into two divisions at an average B of 0.278. Thus there are three divisions indicated by the two major joining branches of the taxa dendrogram and five divisions indicated by the four major joining branches in the taxa dendrogram, forming 15 rectangles of sites with taxa having similar abundances in the community structure. If we classify 0 as being Absent, 1-10 as being Rare, 11-50 as being Occasional, 51-150 as being Common, and greater than 150 as being Dominant (indicated by the pink letters in each rectangle), it becomes clear that the absence of DT-Dicr, DCD-Paga and DCO-Lapp from the CC sites, compared to their mostly rare occurrence in the PC and CPC sites, coupled with the dominance of DS-Pros, DS-Meta and EB-Baet in the CC sites coupled with their mostly occasional occurrence in the PC and CPC sites, are the major reasons the CC sites are separating from the other two sites. Now that different taxa have been identified that have different places in the community structures of these sites, the underlying biological explanation for the differences can be pursued.
In addition to its printing each of these output files directly, BioSim2 generates and saves a single HTML page containing all tables and images of the dendrograms and plots. Tables are saved individually in csv format (comma separated values, ascii text) appropriate for spreadsheets, and the images are saved as jpg files.
Figure 9. Printout of the original (compressed) data matrix rearranged in double-dendrogram order, transposed, and annotated by being broken into subdivisions of sites and taxa having similar abundances in community structure.
Conclusion
BioSim2 is a major improvement over prior versions. Most importantly, it is compatible with most available platforms.
In addition, it offers a more user-friendly interface and eliminates the need to make many of the decisions required before processing.
This beta version is available on Pinkham’sa web sites and a user’s manual (Pearson, et. al) is in preparation. Users are requested to forward their comments/problems to either Pinkham or Pearson at the addresses given so the usefulness of the BioSim2 can continue to grow.
______
a http://www2.norwich.edu/pinkhamc/
Literature Cited
Bonham-Carter, G. F. 1967. Fortran IV program for Q-mode cluster analysis of non-quantitative data using IBM 7090/7094 computers. Kans. Geol. Surv., Computer Contribution No. 17.
Barbour, M.T., J.L. Plafkin, B.P. Bradley, C.G. Graves, and R.W. Wisseman. 1992. Evaluation of EPA’s rapid bioassessment benthic metrics: metric redundancy and variability among reference stream sites, Environmental Toxicology and Chemistry, 11(4): 437-449.
Gonzales, D.A., J.G. Pearson and C.F.A. Pinkham. 1993. User’s manual for BIOSIM1, beta version 1.0. EPA Environmental Monitoring Systems Laboratory, Las Vegas, NV.
Kaesler, R.L. 1970. The Cophenetic correlation Correlation Coefficient in Paleoecology, Geological Society of America Bulletin. Lawrence, KS. pp 1261-1266.
Klemm, D.J., P.A. Lewis and J.M. Lazorchak. 1990. Macroinvertebrate Field and Laboratory Methods for Evaluating the Biological Integrity of Surface Waters. Aquatic Biology Branch, Quality Assurance Research Division, Environmental Monitoring Systems Laboratory. Cincinnati, OH. EPA-600/0-90-000, U.S.EPA, Washington, D.C.
Pankhurst, Richard J. 1991. Practical Taxonomic Computing. Cambridge University Press, Cambridge, 201 pp.
Pearson, J.G., and C.F.A. Pinkham. 1992. Strategy for data analysis in environmental surveys emphasizing the index of biotic similarity and BIOSIM1. Water Environ. Res., 64:901-909.
Pearson, J.G., Brian P. Reid, Carlos F. A. Pinkham, Victor T. Chevalier. In prep. User’s Manual for BiosSim2, a Java-based computer program to calculate the Pinkham-Pearson index of similarity.
Pflafkin, J.L., M.T. Barbour, K.D. Porter, S.K. Gross, and R.M. Hughes. 1989. Rapid Bioassessment Protocols for Use in Streams and Rivers: Benthic Macroinvertebrates and Fish, US EPA, Washington, DC, EPA 444/4-89-001.
Pinkham, C.F.A., and Pearson, J.G. 1976. Applications of a new coefficient of similarity to pollution surveys. J. Water Pollut. Control Fed., 48, 717.
Pinkham, C.F.A., J.G. Pearson, W.L. Clontz and A.E. Asaki. 1975. A Computer Program for Calculations of Measures of Biotic Similarity Between Samples and the Plotting of the Relationship Between These Measures. EATR EB‑TR‑75013, Mar - Sep 74, 41pp.
Pinkham, C.F.A. In prep. Studies of the differences in community structure of stream macroinvertebrates in a Long-Term Ecological Research Watershed in Alaska, Part 1, differences around a confluence of two streams.