Statistical Lingo

As the planning committee thought about this statistical training, we thought it would be important to have a list of key statistical concepts that might be useful to you. At the training we may use many terms, and we want our staff to be on the same page coming into the program. Below you will find a list of definitions and key concepts that we think are important for you to understand as you use statistics. The definitions below have come from a wide range of sources.

Statistics: A tool to help one process and make sense about large amounts of quantitative information or data collected on research trials, human behavior or related projects. Statistics is to data as grammar is to words. (Ellen Fireman)

Population: The entire potential group for which you want to make a statement about. If you are interested in knowing what potatoes Maine people want to eat, the population is all Maine people that eat potatoes. You will rarely, if ever, have all information about a population. We sample that population and make inferences about the total population from the sample data.

Sample: A set of measurements that constitutes part of the population of interest. The larger the sample, generally the greater the confidence that you will have accurately described the population. A random sample is one in which any individual measurement is as likely to be included as any other.

Sample Number: This is the number of observations that make up a set of data.

When you see a capital N, that signifies the whole population number; a lower case n signifies the number in your sample. Example: if you were working with 4-H youth in Maine … the N is the total number of 4-H youth; the n is your subsample of 4-H youth.

Parameters: these are the characteristics of a population. Characteristics include averages, ranges, midpoints, variation about the averageetc.

Normal Distributions: This is also called the Bell-Shaped Curve. A general characteristic or variable of biological populations (humans, plants, etc.) that has more values near the average and fewer toward the ends of the range. This is also a distribution of samples that is symmetrical about the mean,median and mode and with no skewed tails. Consider shoe sizes: if your foot is very small or very large, you have trouble finding shoes. If you are a size 9 or 10 in mens or 7 to 9 in women, your selection generally improves. If you were to plot women’s shoe sizes that work in Extension in Maine, it would likely take this shape.

Parametric Statistics: terms used to describe populations that take on this bell-curve shape (ie, average, variation, etc.), which is also called a “normal sample distribution”

Non-parametric Statistics: Statistics used to describe data that do not follow this type of bell-curve shape. (for example, microbiological samples, water flow or infiltration rates, etc.) Different statistics are used to describe these populations.

Experiment: A planned organized process to determine if a specific treatment (drug, educational program, ad campaign) has an effect.

Hypothesis: theory or issue you want to test with your study.

Note: steps to good experimentation, a) form hypothesis, b) plan experiment(to reduce bias), c) take good data, d) interpret data, e) confirm or reject hypothesis, f) write up and publish the study.

Treatment: the specific material or concept that you want to test in your study (drug, chemical or organic fertilizer, educational program, curriculum, etc.)

The method of manipulating your samples, what you intend to expose or apply to your samples (dose of chemical to plants, dose of blueberry capsule to participants, concentrated dip to potatoes) in your study to determine any effects.

Observational studies: no treatments are applied –experiments are done on people, plants, animals, etc. who might have different characteristics. Observational data is generally observed, noted and recorded.

Note: frequently studies done with humans (malnutrition, fetal alcohol syndrome, etc.) where treatment would not be ethical to apply.

Controlled Experiments: planned processes to determine if a treated material (human, crop, etc.) is different from an untreated material (your control).

Treated groups – groups receiving the treatment

Control groups – groups not receiving the treatment

Historical controls – subjects from a past study act as controls for a current study

Key concept in all experimentation: the control group should be as similar as absolutely possible to the treated groups so that the real differences found can be attributed to the treatment effect.

Experimental Unit: the unit of experimental material to which a treatment is applied: may be a person, leaf, individual plot of corn, … whatever.

Experimental Error: Variability (natural, genetic, environmental soil type variability) among experimental units that cannot be controlled by the researcher. We design our studies to try to limit experimental error as much as possible.

Bias: defined as the chance that a researcher’s preconceived notions about a treatment may influence an experimental outcome.

Subject bias: important to human studies and often called the placebo effect, humans are influenced positively or negatively by the treatment.

Evaluator bias: a potential influence by the person evaluating the treatment. Randomized code labeling of plots helps control this source of bias.

Blind: not allowing the subject to know if they are taking the treatment or placebo. Double blind treatment assignment eliminates both sources of bias.

Variable: a measurable characteristic of an experimental unit.

Quantitative variables: measured numerically. Size of human feet, yield of corn, etc. One can make frequency distributions of these data. These types of data represent a measured quantity.

Categorical variables: Variables that take on names or labels some examples would be a breed of dog (collie, shepherd, husky) or colors of a ball (red, green, yellow)

Quantitative variables can be further broken down into two more types of variables:

Discrete variables: one specific number … must be a single number (ex. number of times during a coin toss that heads came up first). Ex. Number of treatments, and number of dogs owned.

Continuous variables: A value that falls between two specified values (conducting a study on female college student weight, may have previously specified weight ranges 120-130 lbs, 130-140 lbs, etc.; the students’ weight would fall between one of these values. Others include weight of grain, years of ownership of a home, etc.

Qualitative variables: not measured numerically – eye color, social security numbers, much survey information like agree, strongly agree. Etc.

Independent variable: this is the treatment level or factor that you set when conducting a trial. Ex) rate of nitrogen in a corn trial etc.

Dependent variable: this is the information (data) that you collect from the trial … this is dependent on the treatment level applied.

Measures of distributions:

These are three common measures used to describe data in statistics: Mean/Average, Median and Mode:

Average: to find the average sum of your data set (the numbers you have) and divide by the total number of samples. Ex. 3, 7, 5, 15, 2, 8, 10. sum = 50, n (number of samples) = 7, Divide 50 by 7 = your average which is 7.1.

Mean: is the same definition as average. Population mean is given the greek symbol mu (X); the sample mean is a statistic – the estimate drawn from a sample of the population mean. Sample mean is given symbol x bar.

Median: the midpoint of a series of data: rank the numbers in order and find the middle one. 2, 3, 5, 7, 8, 10, 15, - the median is 7.

Note: if median and mean are similar, data are likely parametric, or normally distributed

Mode: The mode is the number repeated most often in your data set. If there are no repeating numbers, then there is no mode for the data set.

Ex. Data set: 3, 5, 10, 2, 3, 6, 15, 3. The mode would be 3 because it is the number repeated the most. Mode = 3

Variance or standard deviation: term that is a measure of the spread (or variability) around the mean of a specific set of data. Generally speaking, the lower the variance or standard deviation around a mean, the more confidence we have that that sample mean is a real or valid number. The figure below illustrates this concept. A variance of 0.5 means that the numbers that make up the mean are tightly distributed around that mean. A variance of 2 suggests a more variability. Symbols for standard deviation is . Variance is 2.

Variance is the square of the standard deviation.

Standard deviation around a population mean is defined by this equation:

[x is a value from the population, m is the mean of all x, n is the number of x in the population, S is the summation]

The estimate of population standard deviation calculated from a random sample is:

[x is an observation from the sample, x-bar is the sample mean, n is the sample size, Sigma is the summation]

Standard error of the mean or standard error:

The standard error is an important statistic for determining whether the mean of one sample is really different from another. To calculate the standard error, you take the variance of a data set and divide it by the number of values that went into making the mean. If you teach a program and you have 14 pre test and 14 post test scores, your calculation of the standard error is the variance of those scores divided by the number of people participating (14). The symbol for the standard error is Sy.

Coefficient of Variation

There are times when you want to describe the variability associated with a research project. You might have taken weights of people in an exercise program and waist size before and after. Loss in weight and waist size should be quantitatively or numerically quite different. After the program, the average weight loss might be 17.5 lbs, and the waist size change might be 3.2 inches. We can use the standard deviations of these data sets and make statements as to whether the data from weight loss was more variable than change in waist size by using the term coefficient of variation. It is calculated by taking the standard deviation divided by the mean and multiplied by 100 to put it onto a percentage basis.

Weight lossWaist size

Sample mean = 17.5Sample mean 3.2

Standard deviation = 3.3Std. Dev. = 1.5

CV = 18.7%CV = 47%

These statistics may be explained by looking at the people who attend. Some people concentrate weight gain in their waist (for example: beer drinking, stock car racing fan, male southerners). If your program had some Southern male beer drinking, stock car racing fans, the change in waist size would likely be a more variable measurement because you had such a great potential for change if they can’t get into their refrigerator and find beer.

CV are useful to get a sense of the general variability of a given test procedure or measurement. If you had a test where people find an average CV of 10%, it is likely a pretty accurate test. In field research, we can use a CV to see how well we might have controlled variability in one location compared to another. It is a useful measure.

For Statistics Junkies … these are important but difficult theoretical concepts:
Degrees of Freedom: This is an important and difficult concept in statistics. It is like an accounting procedure. When you calculate the sample mean for a set of data, and then you use that statistic when you calculate a second statistic (standard deviation). You have to use n-1 instead of n because you have already calculated the mean. Another way to consider this can be seen in this example. Imagine you have four numbers (2, 3, 4 and 5) that must add up to a total of 14; you are free to choose the first three numbers at random, but the fourth must be chosen so that it makes the total equal to 14. With four numbers, you have three degrees of freedom. When you have four treatments in an experiment, you have three degrees of freedom associated with that and used in all calculations of error.
Z scores:
Another way to normalize a set of data is to calculate a Z score. The Z distribution is shown below as well as the calculation.

With normally distributed data, 95% of the data falls within 2 standard deviations around a mean. 99.7% of all data fall within 3 standard deviations around the sample mean.
This has practical value to someone evaluating whether a single data point might be thrown out. Generally if the data point is > 3 standard deviations away from the average, it is considered an outlier point and is safe to eliminate particularly if there are good reasons to do so. For example, you have 14 people in a class who take a pre and post test score. If one or two participants only attended 10% of the class, their post test score would likely be significantly lower than the post-test average. If > 3 standard deviations, this would be a logical score to throw out.