Statistics: in a Nutshell

Statistics: In a Nutshell

A statistic is a number that is calculated to represent some characteristic of a group of numbers. The goal of Statistics is to use these summary numbers to allow us to get a sense of a group of numbers B to build a model= of the group of numbers..

There are two kinds of statistics. One kind C called descriptive statisticsC involves calculating statistics that are designed to tell us something about the specific group of numbers that we actually have collected.

A second kind C called inferential statistics Care statistics that we use to try and get some idea of what the parameters of a population are like. We are trying to find out about a large group, even though we don=t actually have direct information about the large group.

A parameter is a characteristic of a population that gives us a sense of some quality of the population. Parameters are analogous to statistics, except parameters relate to populations and statistics relate to samples. (If you don=t know what the word Aanalogous@ means C or if you come across any word you don=t understand C you should ask someone or look it up in the dictionary!).

In this class, we will begin with descriptive statistics, developing tools to get a simpler and simpler sense of a large group of numbers. At each stage, we will simplify the information that is used to describe a set of numbers. When we do this, we make it easier for our minds to understand what the numbers are telling us, but we also always lose some information, too.

Here is an outline of all the major material we will cover in this class:

Descriptive Statistics

Imagine a bunch of numbers

If we just have a whole bunch of numbers, we don=t get much information. It is hard to tell what is really going on. Suppose we ask 30 students their score on the last exam they took, and these are the responses we get:

75 90 85 100 85 80 90 100 75 100

906080 95 75 90 75 90 85 100

857075 95 75 90 85 85 100 90

What is the general performance of these students? What sense of their academic abilities can we get from just looking at the data in this way?

1/09

We can get a better sense if we can organize the data into a frequency distribution.

A frequency distribution groups the numbers that are the same together, and lists them together with the number of times they occur in the group of numbers. For the exam score data, a frequency distribution looks like this:

X / Frequency of X
60 / 1
65 / 0
70 / 1
75 / 6
80 / 2
85 / 6
90 / 7
95 / 2
100 / 5

From this table, we can see that people generally did fairly well on their last exam C not many people scored below 75. More people seemed to score in the 80s or 90s, and 5 people got a perfect score C 100%.

Sometimes, it also helps if we graph the frequency distribution in a bar graph or histogram, like this:

Or, we can use a different kind of graph called a frequency polygram, that looks something like this:

However, we might still want to summarize these data in an even more succinct way. We can get this by using what are sometimes called the moments of a distribution. The most common summary statistics are measures of central tendency (or, location) and measures of spread (or, variability).

Central Tendency refers to a measure of how high or low the scores are, generally. This is an average.

a)The most common measure of central tendency is the mean (this is probably what you think of when you think of an average). The mean for the data above is 85.67.

b)The median is another measure of central tendency — the score that divides the numbers in half, so that 50% of the scores are above the median and 50% are below the median. The median of the data above is 85.

c)The mode is another measure that is sometimes used to indicate central tendency. The mode is the most common or frequent value. What is the mode of the data above?

Knowing the location or central tendency of a set of numbers really doesn’t tell us that much. It tells us how high or how low the numbers are over all, but it doesn’t tell us anything more. One thing that is very useful to know is how spread out the numbers are. We call this the spread or the variability.

a)There are three common measures of spread or variability. The first of these is called the range. The range is simply the distance between the highest and lowest score in the distribution. That is, we take the lowest score, and subtract it from the highest score to get the range. What is the range of the distribution given above?

b)A second measure of variability is called the variance. The variance is a mean — in this case the mean distance of each score from the overall mean of the distribution. We’ll talk much more about this later. The variance of the distribution above is 99.56.

c)An even more useful measure of variability (for reasons we will see later) is the standard deviation. This is the square root of the variance. For the data above, the standard deviation is 9.98.

Measures of variability and measures of central tendency are used when we are thinking about only one variable. However, often we want to get a picture of two variables, and how they relate to each other.

a)For example, in addition to asking about exam score, suppose I also asked people their overall grade point average. We would expect there to be a relationship such that people with higher GPAs probably got a higher grade on their last exam. To look at this we would calculate a statistic called a correlation.

b)Another thing we could do is try to come up with an equation where we could predict someone’s exam score based on their GPA. We could use our data to create a regression equation. A regression equation would give us a way to make an educated guess about people’s exam scores.

(1)Of course, we are not going to be right all the time, but this would be our best guess, based on the data that we have collected.

Understanding these ways to represent or describe data gives us an ability to get a ‘picture’ of what some specific set of numbers are like. We can get a sense of the overall data with a frequency distribution — either in a table, a histogram, or a frequency polygram. We can understand some characteristics of those data more succinctly with summary statistics such as measures of location, or variability, and of shape.
If we have two variables, we get even more information. We can understand how the two variables relate to each other with a correlation coefficient of a regression equation.

However, these summary statistics only apply to the specific numbers that we have in front of us. If we want to get a sense of the parameters of a population — for example, the mean or standard deviation of the whole population, or the difference between the means of two different populations, etc, then we have to move into inferential statistics.

Inferential Statistics

As mentioned, inferential statistics refer to statistics used to get a sense of what is going on with the population. Because we only have data from a sample, we have to make an inference about what the value of the population parameters are. Because by definition we do not know things about the population, the inferences we make about the population parameters are always accompanied by uncertainty. Fortunately, we can often make certain assumptions that allow us to tell exactly howuncertain we are!

A very important concept for inferential statistics — perhaps the most important concept in this whole course — is the concept of the sampling distribution.

The sampling distribution of a statistic is a frequency distribution of that statistic. That is, if we imagine that we could take every possible sample from some population, and calculate some statistic on each sample, we would have a bunch of numbers — the value of the statistic for each of the different samples.

We can group these into a frequency distribution, and the shape of this distribution, as well as the moments of the distribution (i.e., the mean, standard deviation, regression, correlation), can help us a great deal.

Of course, we can’t really collect every possible sample (if we could, we would also be able to get data on the whole population, and then we wouldn’t need inferential statistics at all). But, we can often assume that the sampling distribution of a particular statistic is of a certain form — a form that corresponds to some theoretical distribution that we already know about (such as the normal distribution).

There are basically two approaches to inferential statistics that are commonly used, and we will cover both of them. Of course, they are closely related to each other, but they take somewhat different approaches.

The first approach can be called the estimation approach. With this approach, we try and come up with some sample statistic that will be a good estimate of a population parameter that we are interested in.

Often, the statistic that is the best estimate is closely related to the parameter. For example, the best estimate of the mean of a population will be the mean of the sample.

a)However, it is important to understand that for other parameters — most of which we will not cover in this class — this may not be the case.

When we have an estimate of a parameter, there will be some uncertainty. That is, we will be pretty sure that the exact population mean is not the same, exactly, as our sample mean.

a)However, we can come up with a confidence interval which gives us a range and we can know that the population is most likely in that range.

(1)For example, we might say that the mean of the population of some value is 12.45, plus or minus 4.2. This means that we are pretty certain (although never 100% certain) that the mean of the population is between 8.25 (=12.45 - 4.2) and 16.65 (=12.45 + 4.2).

(2)The narrower the confidence interval, the more exact we can be about the value of the parameter.

The second approach to inferential statistics is called the null hypothesis testing approach. In this approach, we suppose that a population parameter has a certain value. For example, we might adopt the null hypothesis that the population mean is zero. Then, we test whether, if the population parameter were zero, we would be likely to get the data that we actually got in our sample.

For example, we might have a sample with a mean of 12.45 and a standard deviation of 2. And, the question might be: Is the population mean equal to zero?

a)Appropriate statistical tests can be done and we could say that there is less than a 5% chance that data with a mean of 12.45 and a standard deviation of 2 would come from a population with an actual mean of zero.

b)Since it is therefore very unlikely that the sample we have came from a population with a mean of zero, we can reject the null hypothesis and say that the mean of the population is probably not zero.

We will consider estimation/hypothesis testing about a variety of different parameters that are useful in understanding real data in psychology. We will also learn how to use SPSS to get both the descriptive and inferential statistics that we need.

This is basically everything we will be doing in this class. We will cover a few other things along the way that will help us understand some of the issues that are covered in this outline, but just about everything can fit somewhere in this scheme of things. Simple, no?

1/09