User Guide to Statistical Analyses

User Guide to Statistical Analyses

Most of the statistical analyses are conducted using SAS v.8.02 (SAS Institute, Inc. Cary NC 27513, programs. Analyses include:

(1)the Kolmogorov-Smirnov two sample test, to determine if there are differences in the distribution of distributions of occurrence of clones (1, 2, 3, etc., times) within fingerprint groups. Use our Excel spreadsheet for this test.

(2)One way ANOVAs on the total number of clones found per treatment, or on fingerprint richness, diversity, or evenness in each treatment.

(3)Cluster analysis, to group treatments into similar groups based on the pattern of clones per fingerprint cluster.

(4)Stepwise Discriminant analysis, to identify fingerprint clusters in which the distribution of clones per treatment varies.

(5)Stepwise Regression analysis, to identify fingerprint clusters in which the distribution of clones per treatment varies with some measurement that varies between treatments, for example, the ability of soils in different treatments to suppress a plant parasitic nematode.

(1)Kolmogorov-Smirnov two-sample test: We are using the Kolmogorov-Smirnov two sample test to see if the distributions of occurrence of clones (1, 2, 3, etc., times in a group) are different between treatments. The first step in the analysis is to compile the frequency distributions for the clones in the treatments to be tested:

-Open Microsoft Excel.

-Open Taxonomic Table data which you saved when you computed this data using GCPAT.

-Now you need to compute the frequency with which clones in each treatment occur in groups ranging from a size of 1 to a size of 10 or more. Your treatment data will be organized into columns, with one column per treatment. Use the Excel FREQUENCY function to obtain the clone frequencies for each treatment. In the example below, assume that columns C and D contain treatment data and that the rows containing this data run from x to y. When inputing this statement in Excel you will have to type in actual row numbers instead of x, y, a and k.

Column A / Column B / Column C (treatment 1 column) / Column D (treatment 2 column)
Row a / Frequency / 0 / =FREQUENCY(Cx:Cy,Ba:Bk) / =FREQUENCY(Dx:Dy,Ba:Bk)
Row b / 1
Row c / 2
Row d / 3
Row e / 4
Row f / 5
Row g / 6
Row h / 7
Row i / 8
Row j / 9
Row k / 10

-when you have more than one column for one treatment (ie, replicates within a treatment) and want to compare the distributions of each treatment as a whole, compile the frequency distributions for each replicate, then sum these to get the frequency distribution for the whole treatment. These summed frequency distributions can then be compared.

-Once you have the frequency distributions you’d like to compare, open the KS Test spreadsheet in Excel.

-Enter frequency data (omitting 0 frequency data) for one treatment into the appropriate places in columns B and C:

Enter frequency data here
Frequency of Occurance / 101a / 101b
1 / 676 / 539
2 / 21 / 10
3 / 4 / 2
4 / 1 / 2
5 / 0 / 0
6 / 0 / 0
7 / 0 / 0
8 / 0 / 0
9 / 1 / 0
10 or more / 0 / 0
Total / 703 / 553

-The maximum difference in the cumulative distributions (D) is then calculated and tested against the K-S test statistic (these will be calculated automatically and the information displayed in column H). We reject Ho, or the two populations have the same distribution, if D is greater than the appropriate K-S value.

(2)One-way ANOVAs: we use this test to see if there are differences between treatments. In order to compare treatments, we need multiple observations for each treatment. When you design your experiment, try to accommodate treatment replications, or you may not be able to compare parameter values for each treatment.

-Use SAS to perform this analysis. Below is a sample SAS program that you can rewrite to accommodate your needs.

options ls = 80 ps = 55 nocenter nodate;

/* Set up temporary SAS data set called onewaybacteria */

data onewaybacteria;

/* Two variables to be input, Treatment and bacteria count */

input treatment $ baccount;

/* baccount is the total number of different bacterial clones found for each individual replicate */

/* The "$" indicates that Treatment is a text variable*/

title1'Oneway ANOVA Example';

title2'Bacteria Count';

/* datalines Statement to indicate data is about to begin */

datalines;

101644

101502

101495

zhou610

zhou525

zhou560

mbbb539

mbbb513

mbbb577

mbv549

mbv587

mbv474

;

RUN;

procglm;

/* This analysis has balanced data, but proc glm was used in case

there are unequal replicates on future data*/

class treatment;

/*The MEANS statement with the TUKEY option generates the tukey pairwise

comparisons for the cells. This test protects against inflation of

the type I error rate due to multiple t-tests and uses a constant

error term for the analysis.*/

model baccount=treatment;

means treatment / tukeycldiff ;

run;

This one-way ANOVA example will test for differences in total number of bacteria clones identified between the four treatments. The Tukey means separation test is then applied to identify pairwise differences between the individual treatments. The output from this analysis is presented below.

Oneway ANOVA Example

Bacteria Count

The GLM Procedure

Dependent Variable: baccount

Sum of

Source DF Squares Mean Square F Value Pr > F

Model 3 1330.25000 443.41667 0.13 0.9370

Error 8 26472.66667 3309.08333

Corrected Total 11 27802.91667

R-Square Coeff Var Root MSE baccount Mean

0.047846 10.49879 57.52463 547.9167

Source DF Type I SS Mean Square F Value Pr > F

treatment 3 1330.250000 443.416667 0.13 0.9370

Source DF Type III SS Mean Square F Value Pr > F

treatment 3 1330.250000 443.416667 0.13 0.9370

The GLM Procedure

Tukey's Studentized Range (HSD) Test for baccount

NOTE: This test controls the Type I experimentwise error rate.

Alpha 0.05

Error Degrees of Freedom 8

Error Mean Square 3309.083

Critical Value of Studentized Range 4.52880

Minimum Significant Difference 150.41

Comparisons significant at the 0.05 level are indicated by ***.

Difference Simultaneous

treatment Between 95% Confidence

Comparison Means Limits

zhou - 101 18.00 -132.41 168.41

zhou - mbbb 22.00 -128.41 172.41

zhou - mbv 28.33 -122.08 178.74

101 - zhou -18.00 -168.41 132.41

101 - mbbb 4.00 -146.41 154.41

101 - mbv 10.33 -140.08 160.74

mbbb - zhou -22.00 -172.41 128.41

mbbb - 101 -4.00 -154.41 146.41

mbbb - mbv 6.33 -144.08 156.74

mbv - zhou -28.33 -178.74 122.08

mbv - 101 -10.33 -160.74 140.08

mbv - mbbb -6.33 -156.74 144.08

In this example no significant differences were found between total number of bacteria clones between the four treatments (p=0.13) on the ANOVA and no significant differences were found on the pairwise comparisons. Typically the pairwise comparisons would not be used if the overall F-test was not significant.

(3)Cluster analysis:This method was run to group the treatments into similar groups. The problem in identification of similarity between clone library treatments is that the majority of all identified clones appear in groups containing a single clone, or only a few clones. For this reason the cluster analysis uses only data from clones appearing in larger groups for clustering the treatments, but clear guidelines for what should be considered a large enough group have not been identified as yet. For now, we shall consider groups containing 5 or more clones to be large enough.

-First step: Obtain data for fingerprint groups containing 5 or more clones.

-Open Excel, and open the Taxonomic Table file containing data for all of your clones.

-Save the Table file under a different name. Delete all columns but those describing the Group Number (Column A), the Number of Clones per Group (Column B), and the number of clones per treatment in each group (treatment data columns).

-Highlight all the rows and columns containing data for the fingerprint groups.

-Data -> Sort; Sort by: Number of Clones per Group.

-You should get output that sorts the fingerprints into rows based on how many clones there are in each group. Delete the rows that contain data for fingerprints containing 4 or fewer clones. Save your data, and be sure to use a file name that is different from the Taxonomic Table file or you will lose most of your data and have to re-run the Taxonomic Table function in GCPAT to get it back.

-Second step: format the data for use with the SAS cluster analysis program. For this program the data needs to be in the following format:

Group1 / Group2 / Group3 / Group4 / Group5 / Group6 / Unit
9 / 4 / 0 / 0 / 0 / 0 / One-a
9 / 3 / 1 / 0 / 0 / 0 / One-b
9 / 2 / 3 / 1 / 2 / 0 / One-c
0 / 4 / 1 / 0 / 1 / 0 / Two-a
0 / 1 / 3 / 2 / 0 / 0 / Two-b
0 / 4 / 5 / 1 / 2 / 0 / Two-c
2 / 2 / 4 / 1 / 0 / 0 / Three-a
2 / 3 / 3 / 0 / 0 / 0 / Three-b
2 / 3 / 0 / 0 / 0 / 0 / Three-c
1 / 3 / 1 / 0 / 0 / 5 / Four-a
1 / 2 / 0 / 0 / 0 / 5 / Four-b
1 / 7 / 0 / 0 / 0 / 5 / Four-c

-To format your data, first delete the Number of Clones per Group data column (Column B), then highlight all rows and columns containing data.

-Copy (CTRL + C)

-Open a new blank workbook.

-Edit -> Paste Special: click “Transpose values”

-You will get a list of columns that looks like the above, only the “Unit” column is named “Group Number”. Change this to “Unit” and relocate the information to the last column in the file.

-Save the file.Close the file before attempting to run the SAS program that uses it. Note that the SAS code below assumes that the data is saved in Microsoft Excel v. 4.0 format and will look for a data file that is saved in that format, complete with the appropriate extension.

An example of SAS code for the cluster analysis is:

PROCIMPORT OUT= WORK.liztest

DATAFILE= "C:\Example.xlw"

DBMS=EXCEL4 REPLACE;

GETNAMES=YES;

RUN;

title1'Cluster Analysis Example';

procprint;

run;

procclustermethod=single std;

id unit;

proctree;

run;

When this code is run, we get the resulting output:

(4)Stepwise Discriminant analysis: Given the large number of fingerprint groups in OFRG studies, it would be unfeasible to manually pick out groups, or clusters of groups, that demonstrate treatment differences. To help us locate differences between treatments, we use a stepwise discriminant analysis. We need to look at data from groups containing a sufficient number of clones for analysis. What the optimal number of clones would be for this analysis is unknown, but currently we analyze data from fingerprint groups containing 5 or more clones.

-First step: Open the file you made for Cluster analysis, above, or if you have not performed this analysis, obtain the data from groups containing 5 or more clones and transpose it, as described for generating Cluster analysis data files.

-The format of the file will be slightly different in that the treatment and replicate data are in separate columns and come at the beginning of the data:

Treat / Rep / Group1 / Group2 / Group3 / Group4 / Group5 / Group6
One / a / 9 / 4 / 0 / 0 / 0 / 0
One / b / 9 / 3 / 1 / 0 / 0 / 0
One / c / 9 / 2 / 3 / 1 / 2 / 0
Two / a / 0 / 4 / 1 / 0 / 1 / 0
Two / b / 0 / 1 / 3 / 2 / 0 / 0
Two / c / 0 / 4 / 5 / 1 / 2 / 0
Three / a / 2 / 2 / 4 / 1 / 0 / 0
Three / b / 2 / 3 / 3 / 0 / 0 / 0
Three / c / 2 / 3 / 0 / 0 / 0 / 0
Four / a / 1 / 3 / 1 / 0 / 0 / 5
Four / b / 1 / 2 / 0 / 0 / 0 / 5
Four / c / 1 / 7 / 0 / 0 / 0 / 5

-Save the file under a new name.

An example of a stepwise discriminant SAS program is below.

-Note that you must input the names of each group (or, the Group Number for each fingerprint group) in both the input statement and the proc stepdisc var statement, and that each group name must be recognized by SAS as a text string and not a number.

-To insert data for datalines, you can cut (CTRL + C) and paste (CTRL + V) from Excel spreadsheets.

options ls = 80 ps = 55 nocenter nodate;

/* Set up temporary SAS data set called stepdisc */

data stepdisc;

/* variables to be input are Treatment Replicate, and number of clones per treatment and replicate in each group */

input Treatment $ Replicate $ Group1 Group2 Group3 Group4 Group5 Group6;

/* The "$" indicates that it is a text variable*/

title1'Stepwise Discriminant Analysis';

title2'Example Data';

/* datalines statement to indicate data is about to begin */

datalines;

Onea940000

Oneb931000

Onec923120

Twoa041010

Twob013200

Twoc045120

Threea224100

Threeb233000

Threec230000

Foura131005

Fourb120005

Fourc170005

;

RUN;

procstepdiscdata=stepdisc;

class Treatment;

var Group1 Group2 Group3 Group4 Group5 Group6;

run;

An example of the output is below.

The STEPDISC Procedure

Stepwise Selection: Step 1

Statistics for Entry, DF = 3, 8

Variable R-Square F Value Pr > F Tolerance

Group1 1.0000 Infty <.0001 1.0000

Group2 0.1169 0.35 0.7885 1.0000

Group3 0.3577 1.48 0.2905 1.0000

Group4 0.3220 1.27 0.3493 1.0000

Group5 0.3253 1.29 0.3437 1.0000

Group6 1.0000 Infty <.0001 1.0000

-The program may automatically select groups it thinks should be excluded (or which demonstrate differences in clone distribution between treatments), but we recommend looking instead at the data before any groups are excluded (or, “Step 1” data), since sometimes the number of steps the program can take is less than the number of groups that should be excluded.

-In the above example, two groups have a significant p-value, Group1 and Group6. These groups probably represent microorganisms which have different distributions between treatments.

(5)Stepwise Regression analysis: In this case you are looking for fingerprint groups in which the abundance of clones between treatments varies with some measurement of ecosystem function that varies between treatments. An example of an ecosystem function would be the ability of a soil to suppress a plant parasitic nematode. The data used in this analysis is identical to that used in the Stepwise Discriminant analysis, except this time there is an additional data column for the measurement.

Treat / Rep / Group1 / Group2 / Group3 / Group4 / Group5 / Group6 / Measurement
One / a / 9 / 4 / 0 / 0 / 0 / 0 / 6
One / b / 9 / 3 / 1 / 0 / 0 / 0 / 6
One / c / 9 / 2 / 3 / 1 / 2 / 0 / 6
Two / a / 0 / 4 / 1 / 0 / 1 / 0 / 0
Two / b / 0 / 1 / 3 / 2 / 0 / 0 / 0
Two / c / 0 / 4 / 5 / 1 / 2 / 0 / 0
Three / a / 2 / 2 / 4 / 1 / 0 / 0 / 3
Three / b / 2 / 3 / 3 / 0 / 0 / 0 / 3
Three / c / 2 / 3 / 0 / 0 / 0 / 0 / 3
Four / a / 1 / 3 / 1 / 0 / 0 / 5 / 2
Four / b / 1 / 2 / 0 / 0 / 0 / 5 / 2
Four / c / 1 / 7 / 0 / 0 / 0 / 5 / 2

An example of a stepwise regression SAS program follows:

options ls = 80 ps = 55 nocenter nodate;

data stepwisemeasure;

/* Measurement represents the property to be measured, such as ability to digest waste or suppress plant disease */

input Treatment $ Replicate $ Group1 Group2 Group3

Group4 Group5 Group6 measurement;

/* The "$" indicates that it is a text variable*/

title1'Stepwise Regression';

title2'Example of Measurement Data';

/* datalines statement to indicate data is about to begin */

datalines;

Onea9400006

Oneb9310006

Onec9231206

Twoa0410100

Twob0132000

Twoc0451200

Threea2241003

Threeb2330003

Threec2300003

Foura1310052

Fourb1200052

Fourc1700052

;

RUN;

PROCREGDATA=stepwisemeasure;

MODEL measurement=Group1 Group2 Group3 Group4 Group5 Group6/ SELECTION=STEPWISE;

TITLE1'Stepwise Regression';

/* Add & substract one at a time & compare f0 */

run;

When the above program is run, the output looks like this:

Stepwise Regression 14

The REG Procedure

Model: MODEL1

Dependent Variable: measurement

Stepwise Selection: Step 1

Variable Group1 Entered: R-Square = 0.8971 and C(p) = 3.3051

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 1 50.46000 50.46000 87.15 <.0001

Error 10 5.79000 0.57900

Corrected Total 11 56.25000

Parameter Standard

Variable Estimate Error Type II SS F Value Pr > F

Intercept 1.01000 0.28808 7.11698 12.29 0.0057

Group1 0.58000 0.06213 50.46000 87.15 <.0001

Bounds on condition number: 1, 1

------

Stepwise Selection: Step 2

Variable Group5 Entered: R-Square = 0.9286 and C(p) = 1.8366

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 2 52.23639 26.11819 58.57 <.0001

Error 9 4.01361 0.44596

Corrected Total 11 56.25000

Parameter Standard

Variable Estimate Error Type II SS F Value Pr > F

Intercept 1.19154 0.26869 8.77017 19.67 0.0016

Group1 0.59018 0.05476 51.79362 116.14 <.0001

Group5 -0.50899 0.25503 1.77639 3.98 0.0771

Stepwise Regression 15

The REG Procedure

Model: MODEL1

Dependent Variable: measurement

Stepwise Selection: Step 2

Bounds on condition number: 1.0088, 4.035

------

All variables left in the model are significant at the 0.1500 level.

No other variable met the 0.1500 significance level for entry into the model.

Summary of Stepwise Selection

Variable Variable Number Partial Model

Step Entered Removed Vars In R-Square R-Square C(p) F Value Pr > F

1 Group1 1 0.8971 0.8971 3.3051 87.15 <.0001

2 Group5 2 0.0316 0.9286 1.8366 3.98 0.0771

The program will automatically display the groups in which there is some correlation between the measurement and the number of clones in each treatment. Pick out significant groups based on the p-value. In this case, Group 1 seems to show a significant (p<0.05) correlation between the measurement and the number of clones in each treatment.