D:\proj\docs\Epi_out2.doc 18 July 2003
Description and Interpretation of Statistical/Epidemiological Output
Epi Info Version 6.04d
Kevin M. Sullivan, PhD, MPH, MHA
This document describes and provides interpretation of the output of commands within the ANALYSIS Program of Epi Info. The analytic commands in ANALYSIS and outline in the document is shown below:
FREQ
DESCRIBE
TABLES
2 x 2 Tables
`2 x 2 x S Tables
R x 2 and Rx2xS Tables
R x C Tables
MATCH
MEANS
REGRESS
Simple Linear Regression
Multiple Linear Regression
FREQ command
The FREQ command is used for numeric, character, or date variables to determine the frequency of values. The output is provided in three sections: First, a Table showing the frequency of values for a variable; second, descriptive statistics (shown only for numeric data); and third, a student’s t-test (shown only for numeric data). In the example below, the OSWEGO file is used using the command FREQ AGE. In this dataset, AGE represents age in years. There are many more ages in the actual dataset, only the three youngest and three oldest values are shown.
AGE | Freq Percent Cum.+
3 | 1 1.3% 1.3%
7 | 2 2.7% 4.0%
8 | 2 2.7% 6.7%
...
72 | 1 1.3% 97.3%
74 | 1 1.3% 98.7%
77 | 1 1.3% 100.0%
+
Total | 75 100.0% / AGE Shows values for the variable AGE in ascending order. In this example, the first value is a three-year-old.
Freq The frequency of the observations. In the example, there is only one three-year-old in the datafile.
Percent The percent of individuals at each level of AGE. In the example, there is only one three-year-old out of 75 records in the table, 1/75 = 1.3%
Cum. Cumulative percent; cumulatively adds the percent column
Total Total number of observations in table (75) and the total percent, which is always 100.0%
If you do not want to see the table, which can sometimes be extremely long, add the option /N to the command; for example, FREQ AGE /N
Total Sum Mean Variance Std Dev Std Err
75 2761 36.813 460.181 21.452 2.477
Minimum 25%ile Median 75%ile Maximum Mode
3.000 16.000 36.000 58.000 77.000 11.000
Student's "t", testing whether mean differs from zero.
T statistic = 14.862, df = 74 pvalue = 0.00000
1
Total: The total number of observations included in the table. In the example, 75 observations.Sum: The sum of all values, in this case, the sum of all ages. Usually not a very meaningful number but can be used to calculate the mean.
Mean: The mean or average value of the observations. In the example the mean age of the observations is 36.8 years and can be calculated as the Sum/Total.
Measures of Variability: Three different measures of variability around the mean are provided, the Variance, standard deviation (Std Dev), and the standard error (Std Err). To calculate a 95% two-sided confidence interval around the mean, use the formula:
Mean + (t-value) * (Std Err)
If you assume a large sample size, the t-value would be 1.96, which in the above example, the 95% confidence interval would be calculated as 36.813 + 1.96 (2.477) = (31.958, 41.668). Note that the value of 1.96 should be used only with large sample sizes. A large sample size is around 120 or more observations. For a smaller number of observations, the T-value should be substituted. In the above example with n-1 degrees of freedom (df = 74), the T-value for a 95% two-sided confidence interval would be 1.99. Therefore, the correct confidence interval would be 36.813 + 1.99 (2.477) = (31.884, 41.742), which in this example makes little difference. The t-value is obtained from a table in a statistics book or you can use the PEPI (Computer Programs for Epidemiologic Analysis) software. In PEPI, the program is PVALUE, and use option S.
Minimum and Maximum values: The smallest (minimum) and largest (maximum) values in the table.
Percentile values: The twenty fifth (25%ile), fiftieth (Median) and seventy-fifth (75%ile) percentile values are provided. You can get an idea if the data are skewed if the mean and median substantially differ from one another.
Mode: The most frequent observation in the table. Note that if there is more than one mode in the data, Epi Info presents only the smallest modal value.
Student’s “t”: The statistics are for comparing the mean of the observations with a null value of zero. Sometimes this statistic may be useful, but in most instances it is not useful. In the above example, determining whether the mean age of those in the table differs from zero does not make much sense. However, it would make sense in the following example. Using an example from Rosner (Table 8.1), systolic blood pressure measurements are obtained from 10 women; one measurement from each women is performed at baseline, the women are given oral contraceptives, and the another measurement is obtained. If oral contraceptives do not affect blood pressure, then we would expect the difference between measurements on each women should, on average, be zero. If the oral contraceptives do affect blood pressure, then the second measurement should differ from the first. The data are:
Women No.
1
2
3
4
5 / Measure 1
115
112
107
119
115 / Measure 2
128
115
106
128
122 / Difference
13
3
-1
9
7 / Women No.
6
7
8
9
10 / Measure 1
138
126
105
104
115 / Measure 2
145
132
109
102
117 / Difference
7
6
4
-2
2
If a frequency of the difference is performed, the student’s t test section is a paired t-test. The results of the frequency as performed in Analysis is:
DIFF | Freq Percent Cum.
+
2 | 1 10.0% 10.0%
1 | 1 10.0% 20.0%
2 | 1 10.0% 30.0%
3 | 1 10.0% 40.0%
4 | 1 10.0% 50.0%
6 | 1 10.0% 60.0%
7 | 2 20.0% 80.0%
9 | 1 10.0% 90.0%
13 | 1 10.0% 100.0%
+
Total | 10 100.0% / The variable name for the difference in this example is DIFF which is the same as “difference” in the table above. The differences in systolic blood pressure (measurement 2 minus measurement 1) ranges from -2 to + 13.
1
Total Sum Mean Variance Std Dev Std Err
10 48 4.800 20.844 4.566 1.444
Minimum 25%ile Median 75%ile Maximum Mode
2.000 2.000 5.000 7.000 13.000 7.000
Student's "t", testing whether mean differs from zero.
T statistic = 3.325, df = 9 pvalue = 0.00887
The Student’s t-test statistics makes sense in this example and shows that, on average, a women’s systolic blood pressure will increase after the use of oral contraceptives by 4.8 mm Hg, and that this increase is significantly different from zero (p-value of 0.00887). A 95% two-sided confidence interval around the mean difference can be calculated by:Mean + (t-value) * (Std Err)
In this example, the t-value for a 95% confidence interval would need to be obtained from a text book or from a program, such as PEPI, that can provide the value (Epi Info does not provide a program for this purpose). For a large number of observations, the value of 1.96 can be use. In the above example, look up the t-value for n-1 degrees of freedom (10 -1 = 9) for the value for a two-sided 95% confidence interval, which is 2.262. Applying this to the results above give:
4.8 + 2.262(1.444) = (1.53, 8.07)
The interpretation is that we are 95% confident that the true increase in systolic blood pressure is captured between the values of 1.5 mm Hg and 8.1 mm Hg.
When the FREQ command is used on variables that contain character information, only the table will be presented. For example, using the OSWEGO dataset, doing a frequency of SEX will provide the following table:
SEX | Freq Percent Cum.
+
F | 44 58.7% 58.7%
M | 31 41.3% 100.0%
+
Total | 75 100.0%
Note that descriptive statistics are not provided, since they would not make sense with character information. In this example, 44/75, or 58.7% of the individuals in this table are Females. The order of the values for SEX (F and M) are in ascending order alphabetically. To get confidence intervals for the percents, the command is FREQ SEX /C which results in the following output:
SEX _ Freq Percent Cum. 95% Conf Limit
+
F _ 44 58.7% 58.7% 46.7%69.9%
M _ 31 41.3% 100.0% 30.1%53.3%
+
Total _ 75 100.0%
For small sample sizes, the confidence intervals are calculated using the exact binomial method, and for larger samples sizes, the Fleiss quadratic method is used.
DESCRIBE Command
The DESCRIBE command provides descriptive information for numeric variables. This command provides some of the same information that can be obtained with the FREQ command. Using the OSWEGO data, the output is:
Variable Obs Total Mean Variance Std Dev
AGE 75 2761.000 36.813 460.181 21.452
TIMESUPPER 28 54060.000 1930.714 194162.434 440.639
ONSETTIME 46 52075.000 1132.065 1.119E+06 1057.898
1
TABLES Command
The TABLES command is for creating a tables of categorical data. A single table of 2 or more rows and 2 or more columns can be provided, which will be described as R x C tables (Row by Column). In addition to the row and column variables, you can stratify on additional variables, which will be described as R x C x S tables (Row by Column by Strata).
2 x 2 Tables
In epidemiology, frequently 2 x 2 tables are used. In these types of tables, there is usually an “exposure” variable which can take on two levels (e.g., exposed vs. not exposed) and an outcome variable (e.g., the person had the disease or outcome of interest or they did not). With 2 x 2 tables, the odds ratio (OR) and risk ratio (RR) are calculated. In order to get the correct estimates for the OR and RR, the table must be setup as shown below. The "disease" or "outcome" variable as the column variable and the exposure variable as the row variable. Another important aspect is that for the "disease" or “outcome” variable, those with the outcome of interest should be in the first column (in this example "Yes"), and for those with the exposure of interest should be in the first row (in this example also "Yes").
Disease
Exposed │ Yes No │ Total
───────────┼───────────────┼──────
Yes │ a b │ a+b
No │ c d │ c+d
───────────┼───────────────┼──────
Total │ a+c b+d │a+b+c+d
The OR and RR are calculated as:
OR = (a*d)/(b*c)RR = [a/(a+b)]/[c/(c+d)]
The calculations are carried out in this manner regardless of whether the table is correctly or incorrectly setup. The computer program relies on the user assuring that the information is provided correctly. What happens if the table is setup incorrectly? There are eight possible ways to mix up the table (see Table 1). Only two possible Odds Ratios can be calculated, the true OR (1.75) and the inverse (0.57 or 1/1.75). There are eight different possible “Risk Ratios” that can be calculated, of which only one (1.24) is correct.
The general form of the TABLES command is:
TABLES <row variablecolumn variable> {<one or more stratifying variables>}
or
TABLES <exposure variabledisease variable> {<one or more stratifying variables>}
Using the OSWEGO dataset, an example of a 2 x 2 table is provided using the command TABLES CAKE ILL. Note that the exposure variable CAKE is given first for the rows, and the second variable ILL for the columns. Also note that in the datafile the response to both of these variables was actually entered as “Y” or “N”. In ANALYSIS, all Y/N responses are switched to +/- so that the table is setup correctly. If Y/N’s were used, ANALYSIS would put the responses in alphabetical order, first N then Y. This would make the table setup incorrect for calculating the risk ratio. If you do not want ANALYSIS to switch Y/N responses to +/- responses, at the EPI prompt type:
1
Table 1. Possible ways to arrange a 2x2 table
Disease Status in Columns, Exposure Status in Rows
1. Correct / 2. Disease Columns SwitchedDisease / Disease
Exposed / Yes / No / Total / Exposed / No / Yes / Total
Yes / 27 / 13 / 40 / Yes / 13 / 27 / 40
No / 19 / 16 / 35 / No / 16 / 19 / 35
Total / 46 / 29 / 75 / Total / 29 / 46 / 75
OR = 1.75 / RR = 1.24 / “OR” = 0.57 / “RR” = 0.71
3. Exposure Rows Switched / 4. Both Disease & Exposure Switched
Disease / Disease
Exposed / Yes / No / Total / Exposed / No / Yes / Total
No / 19 / 16 / 35 / No / 16 / 19 / 35
Yes / 27 / 13 / 40 / Yes / 13 / 27 / 40
Total / 46 / 29 / 75 / Total / 29 / 46 / 75
“OR” = 0.57 / “RR” = 0.80 / “OR” = 1.75 / “RR” = 1.41
Disease Status in Rows, Exposure Status in Columns (which is incorrect in Epi Info)
5. Yes in First Column/Row / 6. Exposure Columns SwitchedExposed / Exposed
Disease / Yes / No / Total / Disease / No / Yes / Total
Yes / 27 / 19 / 46 / Yes / 19 / 27 / 46
No / 13 / 16 / 29 / No / 16 / 13 / 29
Total / 40 / 35 / 75 / Total / 35 / 40 / 75
“OR” = 1.75 / “RR” = 1.31 / “OR” = 0.57 / “RR” = 0.75
7. Disease Rows Switched / 8. Both Disease & Exposure Switched
Exposed / Exposed
Disease / Yes / No / Total / Disease / No / Yes / Total
No / 13 / 16 / 29 / No / 16 / 13 / 29
Yes / 27 / 19 / 46 / Yes / 19 / 27 / 46
Total / 40 / 35 / 75 / Total / 35 / 40 / 75
“OR” = 0.57 / “RR” = 0.80 / “OR” = 1.75 / “RR” = 1.34
1
SET SWITCH = OFF
In the example below, the row and column percentages are also presented. To have percentages shown in the table, at the EPI prompt enter:
SET PERCENTS = ON
The output for the TABLES command is in three sections: 1) The table; 2) parameter estimation with confidence intervals and some probability; and 3) statistical tests.
ILLCAKES | + | Total
++
+ | 27 13 | 40
> 67.5% 32.5% > 53.3%
| 58.7% 44.8% |
| 19 16 | 35
> 54.3% 45.7% > 46.7%
| 41.3% 55.2% |
++
Total | 46 29 | 75
| 61.3% 38.7% | / In the table, for each cell, there is the number of observations, and when SET PERCENTS = ON, the row and column percentages. For example, in the upper left cell, there were 27 observations, the row percent is 67.5% (27/40), and the column percent is 58.7% (27/46). The margin totals and percentages are also provided. When the row variable is an “exposure” and the column variable an “outcome,” the proportions have further meaning epidemiologically. In this food borne outbreak, the “attack rate” among the exposed (those who ate the food) was 67.5%, compared to the “attack rate” among those who were not exposed, 54.3%. If the outcome was prevalence, you would have the prevalence of disease among the exposed and prevalence among the nonexposed.
Single Table Analysis
Odds ratio 1.75
Cornfield 95% confidence limits for OR 0.61 < OR < 5.03
Maximum likelihood estimate of OR (MLE) 1.74
Exact 95% confidence limits for MLE 0.62 < OR < 4.97
Exact 95% MidP limits for MLE 0.67 < OR < 4.53
Probability of MLE >= 1.74 if population OR = 1.0 0.17499502
RISK RATIO(RR)(Outcome:ILL=+; Exposure:CAKES=+) 1.24
95% confidence limits for RR 0.86 < RR < 1.80
Ignore risk ratio if case control study
A “Single Table Analysis” is provided for a single 2x2 table. The parameters estimated are the odds ratio (OR) and the risk ratio (RR), which are calculated as described previously. In this example the OR is 1.75 and the RR is 1.24. In this table, the RR is the proportion of cake eaters who became ill divided by the proportion of noncake eaters who became ill, or from the table, 67.5/54.3 = 1.24. The OR in this example is an “unmatched” OR, i.e., the assumption in the 2x2 table is that the data are not from a matched case-control study. Epi Info can deal with matched case-control data which is described later in the MATCH command. For the OR there is an additional point estimate provided based on the conditional maximum likelihood estimate (MLE; for more information on this, please see Rothman, Modern Epidemiology, 1986, page 194). Also provided are confidence intervals. For the odds ratio, three different types of confidenceintervals are provided: Cornfield’s, the Exact, and the Exact Mid-P. Which one should you use? While there is some debate over the merits of different interval methods, of the three methods, the one I would choose is the Exact Mid-P. When the sample size is large, all three methods will provide very similar lower and upper confidence bounds. For the OR, a statistical test similar to the Fisher exact test, is provided on the probability that the MLE OR = 1.0, which in this example the p-value is 0.17499502. For the RR, 95% a confidence interval using the Robins-Greenland-Breslow (RGB) method is calculated. Note that the output provides 95% confidence intervals; to change the interval, for example, to 90%, use the command SET CONFIDENCE = 90. There are some problems with the SET CONFIDENCE command in Version 6.04; first, it only effects the Cornfield’s and RR intervals, and second, the screen will still say “95% confidence limits.” A message is provided telling the user to “Ignore risk ratio if case control study.” A risk ratio cannot be directly calculated from a case-control study.
ChiSquares Pvalues
Uncorrected: 1.37 0.24105316
MantelHaenszel: 1.36 0.24421475
Yates corrected: 0.87 0.34993352 / Three chi square tests are provided and in the many instances they will have similar p-values. Where they differ is when the data are sparse. My own preference is Mantel-Haenszel test. Before performing any analyses, you should decide which chi square test you will use. Tables with an expected cell size < 5 also have Fisher exact 1-tailed and 2-tailed p-values. My preference is to use the Fisher exact 2-tailed.
2 x 2 x S Tables
Another frequent type of analysis in epidemiology is “stratification.” In this case there is an “exposure” variable which can take on two levels (e.g., exposed vs. not exposed), an outcome variable (e.g., the person had the disease or outcome of interest or they did not), and a stratifying variable which may have two or more levels. Stratification is performed to investigate whether a stratifying variable is an effect modifier of the exposure-disease relation, a confounder of the relation, or neither. The general output of stratification is to provide a table, parameter estimation, and statistics for each level of the stratifying variable; after all the stratum-specific results are provided, the summary or pooled results are provided along with tests for interaction.
An example is provided below using the OSWEGO data file with the command TABLES CAKES ILL SEX:
** Beginning Stratified Analysis **SEX =F
ILL
CAKES | + | Total
++
+ | 20 5 | 25
> 80.0% 20.0% > 56.8%
| 66.7% 35.7% |
| 10 9 | 19
> 52.6% 47.4% > 43.2%
| 33.3% 64.3% |
++
Total | 30 14 | 44
| 68.2% 31.8% | / The first table is shown on the left. First, at the very top of the table the message “** Beginning Stratified Analysis **” to let you know that a stratified analysis has been requested. The table on the left is the relation between eating cake and becoming ill among females only (SEX =F).
Below is the Single Table Analysis which provides the odds ration, risk ratio, and statistical tests on the relation between eating cakes and becoming ill among females only.
Single Table Analysis
Stratum 1
Odds ratio 3.60
Cornfield 95% confidence limits for OR 0.79 < OR < 17.26*
*May be inaccurate
Maximum likelihood estimate of OR (MLE) 3.49
Exact 95% confidence limits for MLE 0.80 < OR < 17.20
Exact 95% MidP limits for MLE 0.92 < OR < 14.40
Probability of MLE >= 3.49 if population OR = 1.0 0.05451025
RISK RATIO(RR)(Outcome:ILL=+; Exposure:CAKES=+) 1.52
95% confidence limits for RR 0.95 < RR < 2.43
Ignore risk ratio if case control study
ChiSquares Pvalues
Uncorrected: 3.73 0.05352927
MantelHaenszel: 3.64 0.05631870
Yates corrected: 2.57 0.10873499
SEX =MILL
CAKES | + | Total
++
+ | 7 8 | 15
> 46.7% 53.3% > 48.4%
| 43.8% 53.3% |
| 9 7 | 16
> 56.3% 43.8% > 51.6%
| 56.3% 46.7% |
++
Total | 16 15 | 31
| 51.6% 48.4% | / On the left is the second stratum in which the relation between eating cakes and getting ill is shown for males.
Below are the analysis for the males.
Single Table Analysis
Stratum 2
Odds ratio 0.68
Cornfield 95% confidence limits for OR 0.13 < OR < 3.56