Intro to Data, Data Analysis, and Stats Notes

I. What Scientists Do

Scientists are people having great curiosity; they ask questions about the physical and natural world and universe in an effort to understand and to know how it all ticks. Most often this occurs by a process of hypothetico-deductive reasoning: questions of interest are posed as testable hypotheses and then examined by experiments (or other types of scientific inquiry) designed to refine the working hypothesis, going from a general hypothesis to ever more specific hypotheses. Scientists have developed many ways to analyze collections of observations (“data”) relevant to specific hypotheses so as to objectively test the validity of a hypothesis and to reject those determined to be incorrect. This process is called data analysis.

Today we will begin study of quantitative data analysis. Quantitative data analysis employs a body of mathematics called statistics.

II. What are data? (1 obs = datum [singular]; ≥ 2 = data [pl.])

Data are numeric observations that scientists and social scientists obtain to help answer questions that have been posed as hypotheses or to help describe phenomena.

?Examples: hair color frequency, #live births per week, frequency of HIV-AIDs among drug users in major cities in the US, femur length, CTR, test scores, height, weight, age, etc.

?Data may not always be numeric – observations, descriptions, photos, etc

?In any study, the data to be collected must be clearly defined.

III. Population vs. Sample

A. Statistical Population: ALL of the possible observations one could collect for a single variable; examples:

?for CTR experiment, ALL mussels in Maine (or northeast coast)

?ALL 40-50 YO women with a specific form of cancer in US

?All white pine trees in N. America

Taking data from an entire statistical population is usually impossible. SO, we do what is practical, and collect a representative, or random, SAMPLE of observations from that population and from that make inferences about the whole population.

B. SAMPLE: A representative subset of all the possible observations one could collect from the stat. population.

?Must be collected randomly as far as is possible. What does random mean?

?Must be independent – collection or measurement of one observation should effect the value of any other. (we used only one gill from each mussel)

?Must be large enough sample size (or replication) to be truly representative – this is why we say “5 is good, but 8 would be better, 10 or more excellent” – and not fall to chance variation.

?“good” sample sizes depend on the study.

?For ecology and other field studies, often easy to collect hundreds of measurements quickly.

?Lab experiments, sometimes 5 replications is terrific and perhaps all that is feasible.

For example, if we want to determine the typical stem diameter for pine tree trunks in a certain forest, then we could go out to that forest and measure diameter of every tree in the forest. = entire BIOLOGICAL POPULATION (of size N)

?but is it really feasible to measure EVERY tree? Not if it's a large population with several hundred (or more likely thousdands) individuals!

?instead, because of limitations on our time and the number of individuals we can measure, we SAMPLE.

?measure a subset of our population; choose our sample so that it is representative of the entire populations of interest (i.e. we want to know what the typical DBH is for trees in this population).

?Sample size (or replication) is denoted as “n”, e.g., (always identify the replicate).

?n = 24 trees

?n = 5 gills per group

?n = 3000 40-50 YO Chinese females with pancreatic cancer

Because a sample is usually vastly smaller in size than the target population, techniques, based in probability, have been developed to quantitatively describe and evaluate the inferential power of the sample(s).

• the larger the sample, the better and more reliable the inferences we can make about the statistical population.

- STATISTICS enables us to use information from samples to make inferences about the entire populations and be as objective as possible about drawing conclusions about these populations of interest.

C. Parameters vs Statistics

- another thing we need to know about, is if we're calculating some statistic about our population, are we calculating that statistic for the entire population (i.e. every individual in that population) or are we calculating the statistic based on a sample of the population?

- if we're calc. some value, such as a mean or variance, based on measurements of every individual in the population, then we call that value a PARAMETER. A parameter characterizes the whole population. POP DESCRIPTOR = PARAMETER

- but if we're calc. these values based on a sample or subset of the population, then we are estimating that parameter and we call this a STATISTIC. Remember, we're using this statistic to draw conclusions about the entire population, based only on a representative subset or sample of the population. SAMPLE DESCRIPTOR = STATISTIC

- usually parameters are represented by Greek letters while statistics are represented by Latin letters.

IV. Descriptive (summary) statistics

So, we have a sample. What does it do for us? As a collection observations (original data) it can’t do much in most cases. We therefore use mathematical approaches to describe, or summarize, the attributes of the sample and thereby make estimates of the true population parameters.

Bio 201 Laboratory 3 Bio 201 Intro to Statistics Lecture Notes (Greg Anderson 2004)

A. The Normal Distribution

Many kinds of data, when plotted as a frequency distribution, will follow a pattern called a “normal” distribution. (commonly called the “bell-shaped” curve)

A Normal Distribution01234567891011120510Measurement Category or "bin"Counts

For example, heights of women students at Bates in 2004/05. Most women’s heights will cluster together over a small range of values. At either extreme you get those shorter women (rugby team) and taller women (basketball and volleyball teams). The extremes tend to be rarer.

B. Measures of central tendency

- if we want to know the typical female height in our population of interest, how could we estimate this?

- RIGHT! we could collect a random sample of womens’ heights and calculate the MEAN, or average, for the sample, and this is an estimate of the true mean height of the whole population

?How is the mean calculated? the mean calc. simply by summing up all the measurements and dividing by the number of measurements we have (give formulae on board)

mean = (Σ xi) / n,

where xi = any given datum, and n = sample size.

?we call the mean a measure of CENTRAL TENDANCY; it tells us something about the distribution of data values in the population.

?the mean is the most commonly used measure of central tendency (with normal distributions), but there are others that are sometimes more appropriate to use:

o the MEDIAN (50th percentile obs)

o MODE (most frequent obs).

o PERCENTILES (25th, 75th)

o these are used when frequency distributions of observations don’t fit normal dist. Median often better when distribution is skewed (assymetrical).

C. Measures of variation (variance and standard deviation)

Good as the mean is, it is not sufficient unto itself as a sample descriptor.

Consider two samples having exactly the same mean. Note that the shape (width) of the frequency distributions are quite different (one fatter). WHY?

• Example of two manufacturing process quality control tests. Which one would be better way to go for product consistency? Why?

RIGHT! The skinnier data distribution has measurements that are much similar to each other than the other (= more consistent product). We call that variability among the observations in a data set the SAMPLE VARIANCE.

Variance Calculation: Essentially it is the average squared difference of each observation from the mean of the sample.

- we can use the following formula to calc. The sample variance (s2):

sample variance (s2) = Σ (xi – mean)2

n-1

We square the difference in order to remove the sign (-) from all values; n-1 is used (vs n) to get an unbiased estimate. We err on the side of making it bigger, rather than underestimating it, to make it an unbiased estimate. You will see this approach used many time in statistics.

-Because we're dealing with the sum of squared deviations from the mean, this means that when we calc. the variance, the value we get is in units squared. This isn't particularly helpful because it's tough to think in squared units.

- if we take the square root of the variance, then we'll be back to the units we started with (say cm, instead of cm2). This square root of the variance is called the STANDARD DEVIATION; this is an estimate of the variability around the mean.

Standard deviation (s) = √(s2)

Note that the sample variance (or SD) will be reduced by increasing the sample size up to a point where it eventually will level off.

• Report variability around the mean by stating the mean, plus or minus the standard deviation. Don't forget units!

e.g. 345.5 ± 51.2 um/sec

- now we have a way of describing characteristics of a population sample - a measure of central tendency and a measure of variability.

- Useful facts: in a normal distribution, the mean ± 1 SD encompasses 68% of the observations; mean ± 2 SD encompasses 95% of the observations. Knowing the mode, the mean ±SD, and the range, you can roughly draw the distribution of a data set.

?when reporting descriptive stats, always provide a measure of central tendency, a measure of variability, n, and range (min and max values).

?In figures, the legend should always clearly identify the summary value (measure of central tendency), the measure of variability used, and the sample size (n) for each set of data plotted. For example:

Figuire 1. Mean ciliary transport rate (± 1 SD) of……..… n = 10 gills per group.

?In tables, we usually do this in the column headings. See example next page.

Sneak peek at INFERENTIAL STATISTICS

The reason for using statistics is to help us objectively make decisions about whether or not our testable hypotheses should be rejected given the outcome of our experiments. If we simply look at the data and pronounce a judgement that there was or wasn’t an effect, we could (probably) allow BIAS to enter our analysis and our conclusions will be difficult to substantiate.

Inferential statistical tests are tools that quantitatively assess the differences among groups of data or strengths of associations and yield a numeric measure of how big the difference is or how strong is the association. Based upon that value we can objectively determine the probability that we are seeing that difference or association in sample of our size just due to chance. If the probability is small, then we have a much better basis for concluding that we have a real effect or association.

In the mussel lab you will cheat a little and use the “SD overlap” rubric as a pseudo-inferential, objective means of assessing differences among the groups. This technique is only to be used here and now, or if you are trapped on a desert island (in Survivor 23) and you need a stats test.

• Most reliable with larger sample size (n > 4-5 replicates per group)

• If variability is large, less reliable.

• With larger sample sizes (n > 9-10), some overlap may be OK and still claim differences.