Intro to Probability

First, a review of the t-test and what “p” is supposed to mean

What are we doing here?

Next, what do we do with the number that we calculate here?

What does “p<.05” mean? The letter “p” stands for something, what is it? Where do we get the .05?

What are we trying to draw a conclusion about?

Given that conclusion, what are the decisions we can make? How can they be wrong?

Focusing more specifically on probability

Probabilities describe the likelihood of some event occurring. To talk about this meaningfully, we need some new terminology.

p(x) = the probability of event x happening; p(x|y) = the probability of x happening, given that y has already happened.

Sample Space

sample space = all possible outcomes

Event = each individual outcome

Each event has some probability of occurring. That probability is equal to it’s rate in .

Examples of : coin toss, die roll, this class, etc. Events in those are, heads or tails; 1,2,3,4,5,6; one particular person, respectively. So, if we tossed 1 coin, 1 time, what is ? Heads and Tails. What is the probability of each? There are 2 possible outcomes, each occurs just once, so the probability of heads is ½ and the probability of tails is also ½.

intersection

Properties of Probabilities

1.p()=1

2.for any A, ; that is, the probability of any event has to be no less than zero and no more than one.

3.If the probabilities are “disjoint” (mutually exclusive or no cross over between the sets of possible outcomes) then:

which just means that the probability of the union of a 2 or more outcomes is equal to the sum of the probabilities of the those outcomes.

p(heads)=.5 and p(tails) = .5

The union of the those sets is said, “the probability of heads or tails” and is written:

Sampling

Suppose there’s a box with five numbered (1 to 5) balls in it. We are going to draw a certain number of them out of the box.

N=sample size, which in this case is the number of draws per trial. Each ball has a 1 in N chance of being drawn. Or 1/N or 1/5.

Simple Random Sampling w/Replacement.

Take one out, put it back. In our example, how many possible ways are there to sample?

5, we can get a 1, 2, 3, 4 or 5.

Say we change N to 2. Now how many ways are there?

(1,1) (2,1) (3,1) (4,1) (5,1)

(1,2) (2,2) (3,2) (4,2) (5,2)

(1,3) (2,3) (3,3) (4,3) (5,3)

(1,4) (2,4) (3,4) (4,4) (5,4)

(1,5) (2,5) (3,5) (4,5) (5,5)

So, what is ? What is an example of an event?

There is a simpler way of finding out the total number of possible combinations. Just raise N to n, where N is the number of events in the space and n is the number of draws.

The previous example assumes that we care about order (that we draw a 4 before we draw 5). Say we don’t, all we are interested in knowing is the likelihood that we are going to draw a 4 and a 5. How many ways are there to draw and 4 and a 5? 2. And how big is ? 25. So the answer is 2/25.

There is a more formal way to say this, though.

S={4,5}

Next question: How many outcomes contain the number 2. We can count, but that is inefficient. Instead we:

So if A=(2 on first draw) and B=(2 on second draw) then:

Simple Random Sampling Without Replacement

This means that when take an event out it cannot occur again. So if we take the one ball out, we don’t put it back. Using the example above, assuming we ignore order, there are only 10 possibilities:

(1,2) (1,3) (1,4) (1,5) (2,3) (2,4) (2,5) (3,4) (3,5) (4,5)

So the possibility of any combination is 1/10. As you can see, as we get more and more possible events, it’s going to get a lot tougher to figure out the probabilities. Fortunately, there is a formula for doing so:

in the example above, we would write this as:

Application

Say you wanted to play the lotto. These tend to be pick-6 games of some range of numbers and are sampled with replacement. Let’s say the numbers go 1 to 35.

First, what is N? 35 What is n? 6

So, the total number combinations is 356 = 1,838,265,625

If the pay out is $1,000,000 this seems like a very poor bet to make.

Which of these numbers is more likely to come up: 120744 or 111111?

Even it were set up without replacement:

That’s a little better, but I still wouldn’t play.

Homework: assignment 1 on the webpage
Getting back to our box of balls. What are the odds of picking any specific number on the first draw? After that draw is completed, what are the odds of picking any specific number on the second draw? The probabilities change depending on the first.

Say we draw a five on the first draw. What are the odds of drawing a 4 on the next? What were the odds of drawing the (4,5) combination earlier?

Again, say we draw a five on the first draw, what are the odds of drawing a five on the second?

So when the order of draws is important, things begin to change. What are the odds of drawing 2 followed by drawing a 4?

p(2 then 3) or more properly, p(3|2)

That is, what is the probability of drawing a 3, given that (or after) you have already drawn a 2?

In other words, as long as the events are independent, it’s simply equal to the probability of the second event adjusted for the change in omega.

If you have independent sampling with replacement, p(A|B) = p(B). Coin toss.

Where some people get confused about joint probabilities.

We have been talking about the likelihood of a single event in the presence of a previous event. Earlier we were talking about the likelihood of a series of events. These are not the same thing.

p(THTH) is not the same as p(H|THT)

What are the odds of each of the above?

p(THTH) = .5*.5*.5*.5 = .54 = .0625

While

p(H|THT) = .5

This is because each coin toss is independent.

However, when we are dealing with the joint probabilities of dependent events, things are a bit different.

Say we have developed a new Alzheimer’s screening and we know some things about it. “Positive” means the respondent got a result indicating that he or she has the disease.

A = in reality the subject actually has Alzheimer’s

Ac = the subject does not have Alzheimer’s (Ac is read “not A”)

B = the subject tests positive

(B|A) = a true positive; the subject has Alzheimer’s and gets a test positive result

(B|Ac) = a false positive; positive test result, but the subject does not have Alzheimer’s

Research has given us the probabilities for the above outcomes:

p(B|A) = .75 (true positive)

p(B|Ac) = .15 (false positive)

Now, someone walks in off the street, takes the test and gets a positive result. How likely is it that he actually has the disease?

Thomas Bayes

Born 1702, died 1761. Until 1752 he was a Presbyterian minister. He got interested in probability and began playing around with the kind of numbers I just introduced to you. His work, although really important, was not published until 1764. His innovation was to allow us to incorporate prior information (namely probabilities) into our probability calculations.

What sort of prior probabilities? Primarily base rates. In the example above, the base rate is the rate of the disease in the population. Or the proportion of the population that has Alzheimer’s. As we already know, this proportion is the same as the odds that some randomly chosen individual will have Alzheimer’s.

Remember this:

for conditional (or non-independent) probabilities it’s:

If we do some algebraic substitution and fiddling we get (trust me)

To use this we need one piece of information that we don’t yet have. The rate of the disease in the population, A.

A= .05, 5% of the population has Alzheimer’s.

So,

So, if you get a false positive, there is only a 21% chance you actually have the disease.

Question: Is this the correct base rate to use?

What if we restrict ourselves to a geriatric population? The base rate from them is .5.

This is very useful. The problem is you have to understand something about probabilities to use it. This would be fine if only mathematicians needed to use this. But, as I hope the above example illustrates, this is very important for everyday people to be able to use. Gerd Gigerenzer in his most recent book describes his method of informal frequencies that makes these calculations almost trivial.

p(B|A) = .75 p(B|Ac) = .15 p(A) = .05

p(B|A) = .75 p(B|Ac) = .15 p(A) = .05

Think of 10,000.

Of those how many will test positive?

1.500 should have the disease

2.of those, (.75)(500) = 375 will test positive.

3.of the remaining, (.15)(9500) = 1425 will test positive

4.375+1425=1800 will test positive.

Drawing a decision tree makes the process even easier.

Homework: assignment 2 on the webpage

1