Introduction to Epi-Info

An introduction to Epi-Info

Gavin Shaddick

Minsk, February 2000

.

Table of Contents

Table of Contents

A brief introduction

Starting Epi-Info

Introduction to the Analysis program

Browsing the data

First look at the data

One-way frequency tables

Entering Commands and Variable names using Menus

Displaying Categorical Variables

Statistics on Continuous Variables

Displaying Continuous Variables

Saving variables and leaving Epi-Info

Analysis of Quantitative variables

Introduction

Using the Means Command

Linear Regression

Analysis of Categorical Data

Introduction

Using the Tables command

Using the Statcalc program for stratified analysis

Introduction

Linear trend in proportions

Reading a data file into Epi-info

Creating a questionnaire (qes) File

Variable types

Creating a .rec file

Importing the data

A brief introduction

Epi-info is a multi-purpose computer package designed for use by epidemiological researchers. It contains smaller programs for use with Survey Design (Epiaid), Questionnaire Design and Report Writing (Eped), Data Entry (Enter), Data Checking (Check), Data Analysis (Analysis), Simple Statistics (Statcalc), Importing and Exporting files (Import, Export). There is also a separate package for mapping (EpiMap).

The package is made available by WHO and CDC as public domain software and can be downloaded (free of charge) from .

Starting Epi-Info

Epi-info is a DOS based program, using pull down menus, although a mouse can be used. The cursor can be used to move up and down the menus (using the up arrow and the down arrow) to see the descriptions of the programs. Note the on a colour display an alternative way of moving up and down is to press the highlighted letter for the program you require.

Here we concentrate on the Analysis program

Introduction to the Analysis program

Position the cursor bar on Analysis and read the description on the right hand side. Press ENTER to select Analysis. The screen goes blank for a few seconds and then the Analysis screen appears. The screen is split into two – the upper window is headed Output and the lower, smaller, window is headed Commands. The cursor is on the Command window against the EPI> prompt. At the top of the screen are two lines giving the status information:

Dataset: <None>Free memory: 262K

Use READ to choose a dataset

This indicates that we have not yet specified the name of the dataset to be analysed, and hints how to do it. It also states the amount of free memory.

In order to load a dataset for use, we use the read command, for example if the file is called itpexamp we type

read itpexamp

The full name of an Epi-info file will end with .rec , so the actual name of the file will be itpexamp.rec, but Epi-info allows it to be omitted.

Note that you should enter the whole path as well as the file name, for example a:\itpexamp.rec

The name of the file and the number of records appears at the top of the screen, indicating that the file has been found and read. We also see the all records have been selected (as so far we have not specified any criteria for selecting or rejecting records).


Browsing the data

You can browse the data by pressing F4. As you pass through the different columns, you can see what type of variables they contain at the top of the screen.

If you press F4, Full screen mode is selected, this shows a single record in its entirety.

Pressing F5 will start Split mode, this is a combination of both modes, browse in the top window and Full screen in the bottom

Note that although we entered browse by pressing F4 in the analysis mode, we could have also typed browse at the prompt.

First look at the data

When starting to look at any new set of data, one of the first steps is to check that the values of the variables are sensible and that they correspond to the codes defined in the coding schedule or other documentation about the data. For the categorical variables, we might do one-way tables to check that only the specified codes occur and to check for missing values, for example in the sex field there should only be the values 1 and 2. For continuous variables, we need to obtain summary statistics (mean, standard deviation, minimum, maximum) and to check that these are what we expect.

One-way frequency tables

We start by producing one-way tables for the categorical variables. At the prompt type

tables sex

The resulting table appears in the Output window


This shows that there are 45 males (1’s) and 35 females (2’s) together with percentages, the total number of records (80) and summary statistics (sum, mean and standard deviation). Ignore for now the Student’s t-distribution.

Exercise:

Repeat the tables command for the observed ages (observeage). (Note that in many cases age would have a large number of possible values and so a frequency table might be large and unwieldy and so other commands would be used - however here we have a small range of ages and the table can be quite useful)

What is the youngest age ?

What is the average (mean, mode and median ) age ?

How many 13 year olds are there ?

How many children are 13 years or younger ?

What percentage of the children are 13 years or younger ?

Entering Commands and Variable names using Menus

We have used the tables command by typing it at the command prompt. It is also possible to enter commands by selecting them from a list of commands, similarly it is also possible to select the variable names from a list of variables.

If, for example, we wanted to construct a table of sex, F2 is the Commands key which brings up a list of possible commands . The tables command is in the General section, by highlighting it and pressed ENTER, the command is ‘pasted’ into the command line. Now press the Variables function key, F3, and a list of the variables will be shown. Highlighting sex and pressing ENTER will paste into onto the command line, which can now be entered giving the same results as when we typed in the commands by hand.

If you want to pick more than one variable in this way, as will be the case when we do two way tables, you can tag groups of variables using the plus (+) and minus (-) sign, i.e. select sex and press + and then select observeage and press +. You will see that these two variables will have been tagged (marked) by a small sign, pressing ENTER and both of them will appear in the command line. This also works for more than 2 variables.

Displaying Categorical Variables

The distribution of each of the categorical variables can be displayed using either a bar chart or a pie chart. At the command prompt type (or select)

pie sex

A pie chart should appear on the screen, showing the percentage of males and females. A bar chart can be produced using the command bar

bar sex

Exercise:

Produce pie and bar charts for thyroid medication (THYRMEDICA).

Statistics on Continuous Variables

The command for obtaining summary statistics for continuous variables is means, for example

means height

The output is the same as for the tables command – a frequency table followed by summary statistics. Because there are so many different values for height, the table is much longer. For continuous variables a frequency table is not much use – except for checking for suspicious values. The means command (unlike the tables command) allows us to suppress the frequency table and print only the summary statistics. This is achieved as follows

means height /n

The full specification of the means command includes a grouping variable, but for now we are dealing with all the data together. At this stage we do not want to subdivide the data into groups, for example males and females separately. We need a way of forming one group of all the records in it. This is done as follows:

let groupall = 1

This creates a new variable groupall which has the value 1 for every record. Thus to group the data by groupall will cause all the records to be included in one group. If you browse the data you can see the new variable. We can now use the means command

means height groupall /n

This produces an entirely different output – no frequency table and a total of 11 statistics.

Exercise:

What are the mean and standard deviation of the WEIGHTs of the 80 children ?

What are the median and interquartile range (75th percentile – 25th percentile) of the weights of the 80 children ?

What is the range of the weights of the weights ? (minimum – maximum)

Displaying Continuous Variables

Neither bar charts or pie charts are sensible ways of displaying continuous variables with a lot of different values. Try one of the commands on height and you will see that the result is not very useful.

Bar charts have individual separated bars and are used to display categorical variables for which the order of the categories is irrelevant. Histograms are used for continuous variables.

Usually a continuous variable is grouped before the histogram is drawn. However, if the variable has a relatively small number of distinct values a histogram can sometimes give a good representation of the distribution.

histogram observeage

Exercise:

Are the observed ages of the children approximately Normally distributed ?

Try doing a histogram of the heights of the children

histogram heights

If there are a lot of different values, then the resulting histogram can be less useful. We might want to group the variable. To group the height variable we need to create a new variable, which we will call htgp , which will have grouping interval of 10cm. To form the groups we use the let statement to divide height by 10 and assign the result to htgp. Because we want the new variable to have integer values rather than exact values with decimal places, we use the div operator (this is the way that Epi-info does integer division – the traditional / will give the exact answer)

let htgp = height div 10

Before looking at the histogram, see the effect of the let statement by getting a frequency distribution for htgp

tables htgp

You will see that 8 height groups have been created. Now type

histogram htgp


Exercise:

Are the heights Normally distributed ?

Repeat this process for weights. Create a new variable called wtgp again using the let command and the div operator, choosing a sensible grouping interval.

Are the weights approximately Normally distributed ?

Saving variables and leaving Epi-Info

If you have created new variables, you might want to save them for use the next time you use Epi-Info. You could re-write the original data file, but it is recommended that you save to a new file. To do this you first need to route the output and then to designate a file to which the new dataset (including both the old and new variables) will be saved. If we wanted to save out new dataset to a file called itpnew.rec we would type

route itpnew.rec

Again, it is important that you put in the full path for the file, e.g. a:\itpnew.rec

And then to write the data to that file

write recfile

To leave Epi-Info, press F10 to leave Analysis and return to the main Epi-Info menu, and then press F10 (or select Quit) to leave Epi-Info

Analysis of Quantitative variables

Introduction

Here, we are going to use Epi-info to analyse data in the form of continuous variables, i.e. quantitative variables measured on a continuous scale. We shall use the means command to compare continuous variables classified by categorical variables, we shall also see how two continuous variables can be compared using scatter and regress.

Using the Means Command

In this section we use the ungrouped HEIGHT variable, and ask the hypothetical question of whether a child’s weight varies according to its sex and height.

One of the advantages of using statistical packages is that it is easy to examine the data visually before proceeding to formal statistical analysis. This is one way of checking whether the assumptions made in the analysis are reasonable. We can examine a scatter plot of the data

scatter sex height

Exercise

Execute the above command. Note that the first variable is put on the x-axis and the second on the y-axis.

Guess the mean height for each sex

Mean height for sex = 1
Mean height for sex = 2

Does this suggest an association between height and sex ?

Are there any outlying observations ?

These graphs are not really suitable for presenting the data, since it is difficult to discern the distribution of height where the points are crowded together. An alternative graphical presentation to illustrate the variation in height according to sex is to use histograms.

Exercise:

Type the following:

let hgtrp = height div 10

histogram hgtrp

This produces a histogram of all the values of height. Epi-info allows us to use subgroups of the data with the select command.

Type:

select sex=1

histogram htgrp

To see a histogram of height for males only. Note the result of the select command is shown at the top left of the screen as Criteria: sex=1

Exercise:

Plot a histogram for the heights of females, is there any difference ?

What happens if we forget to type select before select sex=2 ?

(remember to type select again before the next section!)

Recall that we used the means command to derive summary statistics for a variable. Remind yourself of the reason for having to create a new variable with a single value to get the means output for a single variable

let groupall = 1

means weight groupall /n

Make sure that you understand the output, now to calculate the statistics for each sex separately

means height sex /n

The first part is the same summary as you have already seen, but subdivided by each level of sex

Exercise

Do these results compare with what you guessed when you looked at the scatter plot ?

Linear Regression

If we are examining the relationship between two continuous variables, such as height and weight we might start by drawing a scatter diagram before proceeding to formal statistical tests.

scatter height weight

Exercise :

Does there appear to be a straight line relationship between the two variables ? If so, guess the best straight line, now estimate the slope of the best straight line as follows:

Pick two points, A and B, towards the ends of your line (A at the bottom, B at the top). Write down the values of height and weight for each point.

At point A / height
weight
At point B / height
weight

The slope of the line is (heightB – heightA)/(weightA-weightB)

What do you calculate the slope to be ?

We can use Epi-info to perform linear regression using the regress command. To use this to perform a linear regression of height on weight type:

regress height weight

Note that the regress command requires the dependent (response) variable, to go on the y-axis and then the independent (explanatory) variable to go on the x-axis. We are given the correlation, together with 95% confidence limits.

Exercise:

What does this correlation tell you about the relationship between height and weight ?

The program then gives us output which tests the null hypothesis that the slope of the line is equal to zero (i.e. no relationship between the two variables). The next part of the output is the estimated regression coefficients. These are estimates of the parameters  and  in the formula:

height =  +  x weight

The estimate of the parameter  is labeled as the -coefficient for variable weight and is the slope of the line. The estimate of parameter  is labeled as the Y-intercept, i.e. the value of height when weight=0.

Exercise:

What is the equation of the fitted line ? How does it compare to the estimate that you previously calculated ?

Using the equation of the Epi-info fitted line, calculate the following

the predicted value of height for weight = 120
the predicted value of height for weight = 170

How do these values compare with what you would have got using your original equation ?

Note the value of the y-intercept in the absurd case that weight=0, this apparently ridiculous result arises because the relationship between height and weight is not linear over the entire range of the data, although it does look to be a reasonable approximation over the range we are examining. One of the reasons for checking the data graphically is to check whether the relationship might be linear, or whether a curve might be a better description.