Using Descriptive Statistics in Biology

Introduction to Descriptive Statistics

Scientists typically collect data on a sample of a population and use these data to draw conclusions or make inferences about the entire population. Descriptive statistics allows you to describe and quantify differences among data sets. Descriptive statistics, such as mean, median, mode, and rangecan help to highlight trends or patterns in the data. Each of these statistics is appropriate to certain types of data or distributions, e.g. a mean is not appropriate for data with a skewed distribution. Frequency graphs are useful for indicating the distribution of data. Standard deviation and standard error are statistics used to quantify the amount of spread in the data and evaluate the reliability of estimates of the true (population) mean.

Variation in Data

Whether they are obtained from observation or experiments, most biological data show variability. In a set of data values, it is useful to know the value about which most of the data are grouped; the center value. This value can be the mean, median or mode depending on the type of variable involved. The main purpose of these statistics is to summarize important trends in your data and to provide the basis for statistical analyses.

Statistic / Definition and Use / Method of Calculation
Mean / The average of all data entries
Measure of central tendency for normally distributed data / Add up all the data entries
Divide by the total number of data entries
Median / The middle value when data entries are placed in rank order
A good measure of central tendency for skewed distributions / Arrange the data in increasing rank order
Identify the middle value
For an even number of entries, find the midpoint of the two middle values
Mode / The most common data value
Suitable for bimodal distributions and qualitative data / Identify the category with the highest number of data entries using a tally chart or a bar graph
Range / The difference between the smallest and largest data values
Provides a crude indication of data spread / Identify the smallest and largest values and find the difference between them

Distribution of Data

Variability in continuous data is often displayed as a frequency distribution. A frequency plot will indicate whether the data have a normal distribution (A), with a symmetrical spread of data about the mean, or whether the distribution is skewed (B), or bimodal (C). The shape of the distribution will determine which statistic (mean, median or mode) best describes the central tendency of the sample data.

When to NOT calculate a mean:

  1. Do NOT calculate a mean from values that are already means (averages) themselves.
  2. Do NOT calculate a mean of ratios (e.g. percentages) for several groups of different sizes; go back to the raw values and recalculate
  3. Do NOT calculate a mean when the measurement scale is not linear (e.g. pH units are not measured on a linear scale).

Measuring Spread

  • The standard deviation is a frequently used measure of the variability (spread) in a set of data.
  • Usually presented in the form . If the mean is 10 and the standard deviation is calculated to be 2 then you would show the data as 10 ± 2.
  • In a normally distributed set of data,
  • 68% of all data values will lie within one standard deviation (s) of the mean (
  • 95% of all data values will lie within two standard deviations of the mean.
  • A large standard deviation indicates that the data have a lot of variability.
  • A small sample standard deviation indicates that the data are clustered close to the sample mean and has less variability.

In the example above, the mean height of the bean plants was 103 mm ± 11.7. What does this tell us? In a data set with a large number of measurements that are normally distributed, 68.3% of the measurements are expected to fall within 1 standard deviation of the mean and 95.4% of all data points lie within 2 standard deviation of the mean on either side. Thus, in this example, if you assume that this sample of 17 observations is drawn from a population of measurements that are normally distributed, 68.3% of the measurements in the population should fall between 91.3 and 114.7 millimeters and 95.4% of the measurements should fall between 80.1 and 125.9 millimeters.

We can graph the mean and standard deviation of this sample of bean plants using a bar graph with error bars. Standard deviation bars summarize the variation in the data—the more spread out the individual measurements are, the larger the standard deviation. As sample size increases, standard deviation will become a more accurate estimate of the standard deviation of the population.

Understanding Degrees of Freedom

Calculations of sample estimates, such as the standard deviation and variance, use degrees of freedom instead of sample size. The way you calculate degrees of freedom depends on the statistical method you are using, but for calculating the standard deviation, it is defined as 1 less than the sample size (n-1).

Example: Biologists are interested in variation in leg sizes among grasshoppers. They catch five grasshoppers (n=5) in a net and prepare to measure the left legs. As the scientists pull grasshoppers one at a time from the net, they have no way of knowing the leg lengths until they measure them all. In other words, all five leg lengths are free to vary within some general range for this particular species. The scientists measure all five leg lengths and then calculate the mean to be x = 10mm. They then place the grasshoppers back in the net and decide to pull them out one at a time to measure them again. This time, since the biologists already know the mean to be 10, only the first four measurements are free to vary within a given range. If the first four measurements are 8, 9 ,10 and 12 mm, then there is no freedom for the fifth measurement to vary; it has to be 11. Thus, notice they know the sample mean, the number of degrees of freedom is 1 less than the sample size, df = 4.

Two different sets of data can have the same mean and range, yet the distribution of data within the range can be quite different.

In both the data sets pictured in the histograms below, 68% of the values lie within the range and 95% of the values lie within . However, in B, the data values are more tightly clustered around the mean.

Calculating Standard Deviation:

Set up a table like the one below to easily calculate standard deviation.

Calculating Standard Deviation Example:

Data: 2, 5, 9, 12, 15, 17

Calculate mean: 2 + 5 + 9 + 12 + 15 + 17 = 60 60/6=10

Use value from table to calculate s:

2 / 2-10 / (2-10)2 / 64
5 / 5-10 / (5-10)2 / 25
9 / 9-10 / (9-10)2 / 1
12 / 12-10 / (12-10)2 / 4
15 / 15-10 / (15-10)2 / 25
17 / 17-10 / (17-10)2 / 49
168

10 ± 5.8

Reliability of the Mean or Measures of Confidence

You have already seen how to use the standard deviation (s) to quantify the spread or dispersion in your data. The variance () is another such measure of dispersion, but the standard deviation is usually the preferred of these two measures because it is expressed in the original units. Usually you will also want to know how good your sample mean ( is an estimate of the true population mean (µ). This can be indicated by the standard error of the mean (or just standard error—SE). SE is often used as an error measurement simply because it is small, rather than for any good statistical reason. However, it does allow you to calculate the 95% confidence interval (95%CI).

When we measure a particular attribute from a sample of a larger population and calculate a mean for that attribute, we can calculate how closely our sample mean (the statistic) is to the true population mean for that attribute (the parameter). For example: if we calculated the mean number of carapace spots from a sample of six ladybird beetles, how reliable is this statistic as an indicator of the mean number of carapace spots in the whole population? We can find out by calculating the 95% confidence interval.

Reliability of the Sample Mean—Standard Error of the Mean

When we take measurements from samples of a large population, we are using those samples as indicators of the trends in the whole population. Therefore, when we calculate a sample mean, it is useful to know how close that value is to the true population mean. This is not merely an academic exercise; it will enable you to make inferences about the aspect of the population in which you are interested. For this reason, statistics based on samples and used to estimate population parameters are called inferential statistics.

Example: Assume that there is a population of a species of anole lizards living on an island of the Caribbean. If you were able to measure the length of the hind limbs of every individual in this population and then calculate the mean, you would know the value of the population mean. However, there are thousands of individuals, so you take a sample of 10 anoles and calculate the mean hind limb length for that sample. Another researcher working on that island might catch another sample of 10 anoles and calculate the mean hind limb length for this sample and so on.

The sample means of many different samples would be normally distributed. The standard error of the mean(SEM or represents the standard deviation of such a distribution and estimates how close the sample mean is to the population mean. The greater each sample size, the more closely the sample mean will estimate the population mean and therefore the standard error of the mean becomes smaller.

Calculating Standard Error of the Mean

The standard error is simple to calculate and is usually a small value. SE is given by:

Where s = standard deviation and n = sample size.

The standard error of the mean tells you that about 68% of the sample means would be within ±1 standard error of the population mean and 95% would be within ±2 standard errors.

95% Confidence Interval

Another more precise measure of the uncertainty in the mean is the 95% confidence interval (95%CI). This value is usually written as mean ± 95%CI. A 95% confidence limit tells you that, on average, 95 times out of 100, the limits will contain the true population mean.

Once researchers have developed a hypothesis, designed an experiment, collected data and applied a number of descriptive statistics that summarize the data visually, they can apply the standard error statistic as an inference to describe the confidence they have that the means of the sample represent the true means.

Note about error bars: Many bar graphs include error bars, which may represent standard deviation, SEM or 95% CI. When the bars represent SEM, you know that if you took many samples only about 2/3 of the error bars would include the population mean. This is very different from standard deviation bars which show how much variation there is among individual observations in a sample. When the error bars represent 95% CI in a graph, you know that in about 95% of the cases the error bars include the population mean. If a graph shows error bars that represent SEM, you can estimate the 95% CI by making the bars twice as big—this is a fairly accurate approximation for large sample sizes, but for small samples the 95% CI are actually more than twice as big as the SEMs.

Page 1 of 8

Adapted from Strode and Brokaw. HHMI Using Biointeractive Resources to Teach mathematics and Statistics in Biology.